-

Evaluating Large Language Models on Wikipedia Graph Navigation: Insights from the WikiGame

Daniele Margiotta

0 1

Danilo Croce

Roberto Basili

0 0 Department of Enterprise Engineering, University of Rome Tor Vergata , Via del Politecnico 1, 00133, Rome , Italy 1 Reveal s.r.l. , Via Kenia 21, 00144, Rome , Italy

2025

Large Language Models (LLMs) are believed to encode substantial structural and factual knowledge from resources such as Wikipedia, yet the extent to which they can exploit this internalized information for graph-based reasoning tasks remains unclear. We present a systematic evaluation of LLM navigation strategies in the context of the WikiGame, a task requiring players to reach a target Wikipedia page by traversing internal hyperlinks. We introduce a controlled experimental protocol that compares human and model performance across multiple settings, including both “blind” navigation (without access to outgoing links) and “link-aware” navigation (where available links are provided at each step). Using a large-scale dataset of human gameplay, we benchmark state-of-the-art LLMs (GPT-4, Llama 3.1) on identical start-goal pairs, measuring success rate, path eficiency, and error typologies. Our results show that while LLMs can match or surpass human accuracy under certain conditions, they exhibit qualitatively diferent strategies and characteristic failure modes, such as generating structurally invalid paths. Our findings highlight both the potential and the current limitations of LLMs in structured reasoning tasks, and propose a reproducible, game-based framework for assessing their ability to generalize beyond memorization.

eol>WikiGame Wikipedia navigation Large Language Models reasoning human-machine comparison

1. Introduction WikiGame1 (also known as Wikispeedia [6]), a humaninvented challenge where the objective is to navigate Large Language Models (LLMs) have demonstrated re- from a given Wikipedia start page to a target page, usmarkable progress across a wide range of linguistic, ing only internal hyperlinks and as few clicks as possireasoning, and knowledge-intensive tasks [ 1, 2 ]. This ble.Crucially, success in the WikiGame is not a matter of progress is commonly attributed to pre-training on mas- simple recall: it requires sequential link selection, consive, web-scale corpora that include not only unstruc- ceptual inference, and a practical understanding of the tured text, but also highly structured resources such as Wikipedia graph’s structure. Human players bring backWikipedia [3]. As a result, there is increasing speculation ground knowledge, associative reasoning, and an ability that LLMs may implicitly acquire not just isolated facts, to generalize; LLMs, in contrast, are tested on their cabut also the latent structure, the network of hyperlinks, pacity to replicate this process, whether via latent recall, conceptual proximity, and topological organization, of combinatorial reasoning, or structural generalization. sources like Wikipedia [4]. For example, consider the challenge of navigat

However, it remains an open question what it truly ing from Germanium (a chemical element) to Rock means for an LLM to “internalize” a knowledge graph. (geology). While these concepts are related at a high Does the model simply memorize page-level facts and fre- level, Wikipedia’s hyperlink structure does not provide a quent co-occurrences, or does it develop an operational direct or trivial path between them. A successful player understanding of the underlying relational structure, en- must identify and traverse a plausible sequence of interabling it to solve combinatorial navigation tasks that mediate pages, such as: it has not directly memorized [3, 5]? Addressing these questions is essential for assessing the actual capabilities Germanium → Mineral → Earth’s crust → Rock and limitations of LLMs, especially as they are increas- (geology) ingly applied in scenarios that require reasoning beyond surface-level retrieval.

In this work, we address these questions through the avoiding shortcuts that may appear semantically valid but do not correspond to actual Wikipedia links. This task exemplifies the combinatorial complexity and the need CLiC-it 2025: Eleventh Italian Conference on Computational Linguis- for real structural knowledge, rather than rote memorizatics, September 24 — 26, 2025, Cagliari, Italy tion of facts. To rigorously investigate these capacities, c$rodcea@nieinlef.om.uanrigrioomttaa2@.iutn(Dir.oCmrao2c.eit);(Dba. sMilia@rginioftot.au)n;iroma2.it we construct a large-scale dataset of human WikiGame (R. Basili) sessions (approximately 4,000 start-goal pairs), annotate 0000-0001-9111-1950 (D. Croce); 0000-0001-5140-0694 (R. Basili) © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License 1https://www.thewikigame.com/

Attribution 4.0 International (CC BY 4.0). them with empirical dificulty (success rate), and define a agent trajectories and does not systematically benchmark controlled evaluation framework spanning several exper- human or LLM strategies. imental conditions2. We benchmark two state-of-the-art Graph-based neural architectures such as Relational LLMs (Llama 3.1 [7] and GPT-4 [2]) in three settings char- Graph Convolutional Networks have also been evaluacterized by increasing amount of information: (i) Blind ated on multi-hop reasoning tasks over Wikipedia subNavigation, where the model is given only the names graphs [5], highlighting the importance of both symof the start and end pages and must generate a naviga- bolic and learned relational information for efective tion path without any additional guidance; (ii) Chain- pathfinding. Synthetic data approaches [ 9] attempt to reof-Thought Reasoning, where the model is asked to produce human navigation on Wikipedia, showing that explicitly explain the rationale behind each navigational clickstream-inspired trajectories can approximate real step [8]; and (iii) Link-Aware Navigation, where, at user behavior, but do not address the capacity of LLMs to each step, the model is provided with the full list of out- navigate the graph or compare them directly to human going links from the current Wikipedia page, thus closely performance. The WikiGame itself (and variants such as simulating the experience and options available to a hu- Wikispeedia [6]) has long been a benchmark for human man player. semantic navigation, but only recently have researchers

For each configuration, we assess not only overall suc- begun to systematically evaluate LLMs on this task. cess, such as the path optimality, but also analyze failure modes, including invalid links and hallucinated pages. Generalization vs Memorization in LLMs. A core This allows us to explore whether LLM navigation relies research question is whether LLMs’ strong performance on memorization, structural reasoning, or search-like on navigation reflects generalization from distributed strategies. While large models can match or exceed hu- knowledge or mere memorization of surface patterns man performance in some settings, their errors often and co-occurrences [3]. Prior work has highlighted both stem from structural hallucinations, revealing the limits the strengths and limitations of LLMs in knowledgebetween latent knowledge and true reasoning. Our work intensive tasks, but comprehensive, human-comparable ofers a reproducible benchmark and a diagnostic frame- evaluation on graph navigation remains scarce. work for evaluating how LLMs internalize knowledge graphs, with implications for model evaluation and the distinction between memorization and generalization.

In the rest of the paper, we review related work in Section 2, define the WikiGame task in Section 3, present experiments and results in Section 4, and conclude with key findings and future perspectives in Section 5.

Our Contribution. In contrast to previous research, our study ofers a systematic comparison between humans and state-of-the-art LLMs on identical WikiGame challenges. By varying the information available to the models (blind vs. link-aware settings) and evaluating not only success rates but also the nature of errors (e.g., invalid links, hallucinated pages), we provide new insights 2. Related Work into the mechanisms that underlie LLM navigation strategies. This framework enables us to directly probe the exLLMs as Knowledge Graph Navigators. The ques- tent to which LLMs genuinely reason about Wikipedia’s tion of whether Large Language Models can serve as structure versus relying on rote memorization or surface implicit knowledge bases [3], and, more deeply, whether heuristics. they internalize the structural and relational properties of graph-based resources, has received increasing atten- 3. WikiGame as a Probe for LLM tion. While early benchmarks focused on factual recall or simple question answering [ 3, 1 ], more recent work ex- Reasoning plores reasoning, pathfinding, and multi-hop navigation on graph-structured data.

In this section, we formalize the WikiGame as a graph

navigation task and motivate its value as a benchmark Navigation in Wikipedia and the WikiGame. for large language models. We outline our experimental Wikipedia, as a richly interlinked graph, has served as protocol for evaluating LLM reasoning under diferent a challenging environment for both algorithmic agents information settings and introduce metrics to distinguish and neural models. Zaheer et al. [4] train agents to im- memorization, structural generalization, and explicit reaitate random walks on Wikipedia, showing that neural soning. policies can learn to reach distant targets by leveraging These methodological choices establish a solid foundagraph regularities. However, their focus is on synthetic tion for analyzing the strategies and limitations of both human and model-based Wikipedia navigation. 2All software and datasets are publicly available on GitHub at https: //github.com/crux82/wikigame-llm-eval. 3.1. From Encyclopedia to Graph: Blind Navigation with Chain-of-Thought ReasonFormalizing Wikipedia Navigation ing. This mode extends the previous setting by requiring the model to articulate, in natural language, the reaWikipedia can naturally be represented as a directed soning behind each navigational step. The sequence graph, where each node corresponds to an article and of justifications ofers a window into the intermediate each directed edge to a hyperlink from one article to representations and planning strategies of the model, another. Formally, let = (, ℰ ) denote the Wikipedia helping us distinguish whether successful paths arise hyperlink graph, with the set of pages and ℰ the set of from semantically-grounded reasoning or from statistidirected edges such that (, ) ∈ ℰ if is hyperlinked cal shortcuts. Moreover, Chain-of-Thought (CoT) superwithin the text of . vision [8] enables us to quantify the impact of explicit

Given this structure, the WikiGame can be formu- reasoning on path quality and error rates. As before, lated as a pathfinding problem: starting from a source the model is not exposed to outgoing links at any point. node (the Start page), the agent must reach a target The prompt design for this condition is detailed in Apnode (the End page) by traversing a sequence of nodes pendix B. (0 = , 1, . . . , = ), such that each consecutive pair (, +1) corresponds to an existing edge in ℰ . Link-Aware Navigation (Stepwise Choice). Finally,

The challenge lies not only in finding any path from the link-aware mode simulates the actual gameplay ex to , but in selecting paths that are plausible and efi- perience: at each step, the model receives the set of outcient, i.e., minimizing the number of steps, in line with going links from the current node, and is requested to typical game objectives and human strategies. At each select the next node (page) to traverse. This setting distep, the agent’s possible actions are constrained to the rectly tests the model’s ability to reason under stepwise outgoing links from the current page, and (depending on constraints, avoid invalid transitions, and make locally the experimental condition) may or may not be explicitly grounded decisions. Notably, this scenario also allows for visible to the agent. direct comparison to human strategies, since the action

This formalization allows us to cast the WikiGame as space at each step is identical to what a player would a sequential decision-making problem over a partially see. Here, the primary sources of error are choices of observable and large-scale real-world graph. Crucially, suboptimal but valid links, and the rate of hallucinated success requires not only factual knowledge, but also steps should, in principle, be minimized. See Appendix C structural reasoning and the ability to generalize over for the full prompt.

Wikipedia’s highly interconnected topology, making it a compelling testbed for both human and artificial agents.

Success Rate. The most immediate measure is the

Blind Navigation (Direct Path Prediction). In the success rate, defined as the proportion of WikiGame inblind setting, the model is presented only with the ti- stances in which the agent (human or LLM, under a given tles of the start node () and end node (), and is asked strategy) successfully reaches the target node starting to output a plausible sequence of Wikipedia page titles from node via a valid sequence of Wikipedia links. This forming a path from to . Crucially, at no step does metric provides a high-level view of navigational ability, the model observe the set of valid outgoing links from aggregating all sources of error into a single outcome any node. This setting tests whether LLMs can retrieve variable. High success rates in the blind setting, for inor reconstruct complex multi-step relations from inter- stance, may indicate substantial memorization or internalized knowledge, probing their ability to generalize, nalized global structure, while improvements in CoT or rather than simply recall isolated facts. Of particular in- link-aware settings can reveal the role of explicit reasonterest here is whether errors reflect “hallucinated” nodes ing or contextual cues. Contrasting success across these (page titles not present in Wikipedia) or “hallucinated” modes helps disentangle whether LLMs rely on static links (pairs of existing pages for which no hyperlink ex- recall, reasoning over implicit knowledge, or dynamic ists in the actual graph). Such distinctions shed light on use of available context. whether the model’s apparent knowledge is structural or superficial. The precise prompt is in Appendix A.

3.2. Probing LLM Competence: Experimental Paradigms We evaluate LLMs under three progressively informative settings, each designed to probe a diferent aspect of their reasoning and navigation abilities: 3.3. Evaluation Metrics: Dissecting Navigational Behavior To assess the navigation and reasoning abilities of LLMs in the WikiGame, we employ complementary evaluation metrics that capture diferent aspects of task performance, including memorization, generalization, and strategy.

Mean Path Length (with Standard Deviation). Be- broad base allows for a detailed analysis of game difiyond mere task completion, we consider the eficiency culty: as shown in Figure 1, the distribution of human of navigation. For all successful paths, we compute the success is highly skewed, with only a handful of games average number of steps required to reach the goal, along approaching high completion rates and the majority poswith the standard deviation to capture variability across ing a real challenge to human intuition and knowledge. trials. Shorter average path lengths may suggest direct or globally informed strategies (possibly indicative of inter- 21.3% 21.9% 22.6% nalized conceptual proximity or shortcut-finding) while 20 longer or more variable paths can reveal hesitancy, local 15.8% search, or lack of structural insight. Comparing path 15 lengths between humans and LLMs, and among settings, 10 10.2% provides a window into diferences in search strategy 5.9% and the quality of graph representations. 5 Icnomvapluidte LthienkperRcaenteta.geFoofrnmavoigdaetli-obnasaetdtemsoplutstiionnws,hiwche 0 10% 10 20%20 30%30 40%40 50%50 60%60 70%70 80%80 90%90 100% a transition is made between two existing Wikipedia Percentage of players who completed the game pages, but the selected edge does not actually exist among Figure 1: Distribution of human success rates across the 4000 the outgoing links of the current page (i.e., (, +1) ∈/ collected WikiGame instances. Each bar shows the percentage ℰ , even though +1 ∈ ). This error mode is critical for of games whose completion rate by human players falls within probing the distinction between true structural general- the corresponding interval. Most games are far from trivial, ization and shallow recall: frequent invalid links imply with only a small fraction of tasks having high human success that the model has learned about entities but not their ac- rates. tual connectivity, while low rates suggest a more faithful reconstruction of Wikipedia’s hyperlink topology. Notably, we expect invalid link errors to be most revealing in the blind setting, where the temptation to hallucinate plausible (but non-existent) transitions is highest.

Invalid Page Rate. Complementary to the above, we also measure the proportion of model-generated paths in which one or more nodes () do not correspond to any real Wikipedia page ( ∈/ ). This captures a distinct failure mode (hallucination of nonexistent entities) which can arise from overgeneralization or semantic drift. Tracking this error across diferent strategies (e.g., whether it is reduced by explicit reasoning or by access to real links) informs our understanding of the interplay between LLM world knowledge and task-specific prompt structure.

4. Experimental Evaluation 4.1. Experimental Setup

We begin by collecting a large corpus of human gameplay data from the public WikiGame platform (thewikigame.com), which assigns users random startgoal Wikipedia pairs and records navigation attempts, both successful and unsuccessful. Using a custom scraping tool, we continuously harvested game records over several weeks, yielding over 4000 unique games, each annotated with start and target page, number of attempts, completion count, and aggregated success rate. This

To ensure both representativeness and feasibility for

LLM evaluation, we structured our experimental dataset by dificulty, grouping games based on their human success rate: Medium (50% ≤ success rate < 75%), Hard (25% ≤ success rate < 50%), Very Hard (1% ≤ success rate < 25%), and Impossible (success rate = 0%). The Easy category (success rate > 75%) was excluded, as it contained only 6 games. From each bin, we selected the 30 most-played games, resulting in a diverse set of 120 start-goal pairs that accurately reflect the real distribution of task dificulty, while keeping the evaluation manageable.

For the model-based experiments, we selected a panel of LLMs that exemplifies the diversity of current architectures, scales, and access paradigms. Our evaluation includes the latest proprietary GPT-4 models accessed via the OpenAI API: gpt-4.13, gpt-4.1-mini4, gpt-4.1-nano5, and gpt-4o-mini6, chosen to cover a spectrum from flagship large-scale models to compact and cost-eficient variants. For the open-weight evaluation, we used Meta’s Llama 3.1-8B-Instruct7, deployed locally to ensure experimental control and reproducibility. This experimental design allows for direct comparison across proprietary versus open models, and across varying model sizes and training data coverage. The GPT-4 family was selected to probe the limits of pro3https://platform.openai.com/docs/models/gpt-4.1 4https://platform.openai.com/docs/models/gpt-4.1-mini 5https://platform.openai.com/docs/models/gpt-4.1-nano 6https://platform.openai.com/docs/models/gpt-4o-mini 7https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct prietary large-scale models with extensive training on only the largest model (GPT-4.1) approaches or matches web data (including Wikipedia), while the Llama variant human performance, particularly on the less dificult allows us to assess the capabilities of an open, smaller- games. When models are provided with explicit link inscale architecture under more restricted computational formation (Link-aware mode), success rates increase draresources (local GPU). As can be seen in Figure 2 all mod- matically for all GPT-based models, with GPT-4.1 achievels were evaluated under the three experimental modes ing perfect accuracy (100%) even on the hardest tasks. In introduced in Section 3.2: Blind Navigation, Blind Naviga- contrast, smaller models and Llama 3.1-8B exhibit lower tion with Chain-of-Thought, and Link-Aware Navigation. overall performance and are especially challenged as difFor OpenAI models, API calls were made with determin- ficulty rises. istic temperature ( = 0) to ensure reproducibility. For Table 1 details these trends, confirming the strong adLlama, inference was run on local GPUs using a greedy vantage of large-scale LLMs when given access to link decoding strategy, avoiding any probabilistic sampling structure, and quantifying the performance gap between and thus ensuring fully reproducible outputs. Notably, model families and sizes, as well as with respect to huthe Blind modes required a single API call per game, mans. Notably, large models like GPT-4.1 nearly match while the Link-Aware mode demanded one call per navi- or even exceed human accuracy on medium and hard gation step, increasing both API cost and computational games, even when required to hallucinate plausible paths resources, a key reason for limiting the test set to 120 without structural information, demonstrating substangames. All model outputs were automatically checked for tial internalized knowledge of Wikipedia’s structure. structural validity (i.e., presence of only real Wikipedia However, this capacity rapidly diminishes for smaller page names and hyperlinks), with detailed error metrics models and Llama 3.1-8B, underscoring the importance of collected as described in Section 3.3. To ensure trans- both scale and training diversity for generalization in this parency and reproducibility, all data, code, and prompt combinatorial setting. The Link-aware condition reveals templates are publicly available8. that explicit access to local structure allows even smaller models to become more competitive, and often enables 100% large LLMs to outperform humans on the most dificult 80% tasks. These results highlight that while large LLMs internalize part of Wikipedia’s global structure, their ability 60% to generalize without explicit context remains limited; 40% access to structural cues is critical for bridging the gap between memorization and robust reasoning. 20% Remarkably, GPT-4.1 achieves success rates in the 0% Blind setting that are nearly indistinguishable from those GPT-4.1 GPT-4.1-mini GPT-4o-mini GPT-4.1-nano LLAMA 3.1 of human players, despite not having access to the out

Blind CoT Link-aware Human going links at each step, an advantage always available Figure 2: Success rates achieved by humans and diferent to humans. This surprising alignment suggests that GPTLLMs on the 120 WikiGame tasks, ordered by dificulty, from 4.1 has internalized a substantial portion of Wikipedia’s easiest on the left to most dificult on the right (based on structure, likely as a result of large-scale pretraining. human success rates) Such performance raises the question of whether these and experimental mode. LLM performance is shown for No models are simply memorizing large parts of Wikipedia’s

Reasoning, Chain-of-Thought (CoT), and Link-aware link graph or have developed more generalizable strateconditions; human baseline is reported for comparison. gies for navigation. In any case, the fact that a model can solve the task as well as humans, even when deprived of crucial contextual information, highlights both the 4.2. Results and Discussion strengths and the unresolved boundaries of current LLM capabilities.

Comparing Human and Model Success. We present a comparative analysis of human participants and Large Language Models (LLMs) across the full set of WikiGame tasks, stratified by dificulty. As shown in Figure 1, human success rates decrease steadily as task dificulty increases, from approximately 56% on Medium tasks to 0% on the Impossible category. Looking instead at Figure 2, in the Blind settings (No Reasoning and Chain-of-Thought), 8Links to the dataset and code repository will be provided after acceptance.

Error Analysis: Invalid Links and Hallucinated Pages We further analyze model behavior by quantifying two principal categories of structural error: invalid links - transitions between real Wikipedia pages that are not connected in the actual hyperlink graph (and invalid (hallucinated) pages) nodes that do not exist in Wikipedia.

Tables 2 and 3 summarize the error rates for all models, dificulty levels, and information settings.

Invalid links represent the dominant failure mode Table 1 Table 2 Detailed success rates (%) of each model and human partici- Invalid Link Rate: Percentage of navigation paths containing pants, across all dificulty categories and experimental settings at least one transition between two existing Wikipedia pages (No Reasoning, Chain-of-Thought, Link-aware) for which no hyperlink actually exists. Results are reported for all models, dificulty bins, and experimental modes. overgeneration in less robust models. Together, these results illustrate that the core challenge for LLMs in blind navigation is not the invention of entirely new entities, across all models, especially in the Blind and Chain-of- but rather the generation of plausible-yet-nonexistent Thought (CoT) conditions. Here, smaller models such links between real Wikipedia pages. Invalid link rates as GPT-4.1-nano and Llama 3.1-8B often exceed 70–80% are highly sensitive to both model scale and the availabilinvalid link rates, while the best-performing model (GPT- ity of local context, whereas invalid page rates remain a 4.1) remains substantially lower but is still afected by in- secondary but informative indicator of robustness. The creasing task dificulty. Interestingly, generating explicit error patterns reinforce that, while large LLMs have inreasoning with CoT prompts only marginally reduces ternalized significant aspects of Wikipedia’s structure, invalid link errors, and in some cases may even exacer- their global knowledge is incomplete and patchy, most bate them, suggesting that stepwise justifications do not evident when explicit structural feedback is absent. systematically enhance structural fidelity.

By contrast, providing local link information (Link- Navigation Eficiency. Table 4 reports the average aware mode) yields dramatic improvements for all GPT- path lengths for each model and human participants based models, with invalid link rates dropping to near- across task dificulty and experimental mode, revealing zero on most settings, regardless of dificulty. This high- a marked distinction in navigation eficiency. In both lights the centrality of explicit structural cues for accurate the Blind (No Reasoning) and Chain-of-Thought (CoT) graph traversal. Notably, Llama 3.1-8B still struggles with settings, all language models produce navigation paths invalid links even in the Link-aware setting, indicating that are, on average, substantially shorter than those of architectural and training limitations not overcome by human players. For instance, on Medium and Hard tasks, local information alone. The generation of nonexistent humans typically require around 5.5 and 6.6 steps respecWikipedia pages is a less frequent, but still important, tively, whereas top-performing LLMs such as GPT-4.1 error type. Invalid page rates remain below 10% for most solve the same tasks in just 3-4 steps. This pattern sugmodels and settings, with higher incidences concentrated gests that, when unconstrained by real hyperlink options, among smaller models and in the most challenging tasks. LLMs tend to "jump" directly to the goal, likely exploitThe GPT-4 family is notably conservative, rarely halluci- ing their internal representations of semantic relatedness nating new pages, while Llama 3.1-8B and smaller GPT and making aggressive, shortcut-like connections not variants are more prone to this error, particularly under accessible to humans.

Blind conditions. CoT reasoning occasionally increases In contrast, when models are placed in the Link-aware invalid page rates, perhaps reflecting a tendency toward mode (where only valid outgoing links are visible at each step) average path lengths increase and can even approach or exceed human averages, particularly for more dificult games. This shift reflects a more conservative in smaller models such as Llama 3.1-8B, which tend to and locally grounded navigation style: restricted to real generate less coherent or more error-prone sequences. Inoptions, models avoid risky or speculative moves and terestingly, when faced with semantically distant or couninstead opt for safer, if longer, paths. The diference is terintuitive start-goal pairs, even the best models strugespecially evident in smaller models (e.g., Llama 3.1-8B), gle: their errors, however, remain structured (centered on which show much greater variance and, in some cases, plausible but nonexistent links) rather than descending excessively long solutions as task complexity grows. into nonsensical outputs. This points to an internaliza

Interestingly, while shorter paths might seem opti- tion of Wikipedia’s “semantic landscape” that is broad mal, this eficiency in Blind settings often arises from but incomplete, with brittle spots where the true hyperthe use of invalid or hallucinated links, as indicated in link structure diverges from distributional similarity. A our previous error analysis. By contrast, the slightly further finding concerns the limits of Chain-of-Thought longer paths produced in Link-aware mode are typically prompting in structurally constrained tasks. While vermore faithful to Wikipedia’s structure, and thus better balized reasoning can support performance on factoid reflect human-like and valid solutions. Consequently, or arithmetic challenges, in navigation it sometimes enpath length should always be interpreted alongside error courages overgeneration or speculative shortcuts, highrates: eficiency alone does not guarantee correctness, lighting the limits of purely linguistic supervision for and valid navigation sometimes demands a willingness inherently graph-based reasoning problems. to take longer, but legal, routes through the graph. A further nuance emerges from our error analysis: not all invalid links proposed by LLMs are necessarily misKey Insights and Open Challenges. Beyond quanti- takes in a semantic sense. In several cases, especially in tative gains, our study reveals several less obvious but cru- the Blind navigation setting, the models generate transicial insights into LLM navigation and reasoning. Larger tions between pages that are not currently hyperlinked GPT models, by virtue of scale and pretraining diversity, in Wikipedia, but which would be both meaningful and are able to recombine fragments of Wikipedia knowl- contextually appropriate. This phenomenon highlights edge into plausible multi-step paths, even when direct a subtle limitation of the evaluation protocol itself: the supervision for these specific routes is unlikely. This Wikipedia graph, while vast, is not exhaustive, and may compositional ability is especially evident in challenging omit reasonable connections that a knowledgeable agent settings, where models often leverage high-trafic "hub" could plausibly infer. Consequently, some LLM “hallucipages as implicit waypoints—a behavior rarely observed nations” may in fact surface gaps in the existing knowledge structure rather than true model failures. This ambiguity complicates the strict interpretation of invalid link rates: high-performing models may occasionally reveal “missing links” that reflect creative generalization rather than simple error.

We acknowledge financial support from the PNRR project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by the NextGenerationEU and support from Project ECS 0000024 Rome Technopole - CUP B83C22002820006, NRP Mission 4 Component 2 Investment 1.5, Funded by the European Union - NextGenerationEU.

A. Prompt: Blind - No Reasoning

This prompt instructs the model to generate a direct navigation path from a given start Wikipedia page to a target page, using as few steps as possible. The model must output only the sequence of page titles, for the model it is as if it were the link, (separated by "->") with no explanation or reasoning, simulating the most basic WikiGame navigation scenario without any access to outgoing links or intermediate guidance.

How to play: A start page and an end page on Wikipedia are selected. These can be chosen randomly or decided by the players. Starting from the Start_Node, you must click only on internal links found within the main body of the article to reach the

End_Node.

Your task: The user will provide a Start_Node and an End_Node.

You must generate a path from the start to the end, trying to use the fewest possible link hops.

Do not explain anything.

The only output should be: - A line containing ### - A single line with the names of the pages in the path, separated by -> (e.g., Page1 -> Page2 -> Page3) Expected output format: ### Page1 -> Page2 -> Page3 -> Page4 Important: - Page1 -> Page2 -> Page3 -> Page4 it’s only an example for the output format, don’t use as solution - Write only the page titles separated by ->. - Do not include any reasoning or explanation. - Do not write anything before or after the final line.

- Start your output with ### on a line by itself.

B. Prompt: Blind - Reasoning

This prompt requires the model to solve the WikiGame navigation task while explicitly articulating the reasoning behind each step. At every hop, the model must briefly explain its choice, and only after completing the path, output the full solution as a sequence of page titles. This setting aims to probe the model’s internal reasoning process and to assess whether explanation improves path validity or plausibility.

End_Node.

Your task: solve the path from the Start Node to the End Node using as few steps as possible.

At each step, you must explain why you’re clicking on the chosen link.

Once you’ve reached the destination, write the full path using -> between page names.

Instructions: You will be given two page names: Start_Node and End_Node.

Starting from Start_Node, find a path to reach End_Node.

At each step, explain briefly why you’re choosing that link.

When you reach the destination: - First, think to an Explanation to reach the End_Node from Start_Node - Then write a line with just ### - Finally write the full path as a list of link names separated by -> - Do not include any text before or after the final path Important: - Do not skip the ### line before the full path. - Do not add explanations after the ### section. - The final line must contain only Wikipedia page titles separated by ->, nothing else. - The final line must contain all the page title ordered by the order choice during the Explanation. - The final line must start with the Start_Node and finish with the End_Node (whitout explanation or suffix) Expected output format: Explanation: 1. I start at "Page 1" (Start_Node) and click on "Page 2" because ... 2. From "Page 2", I click on "Page 3" because ... 3. From "Page 3", I go to "Page 4" (End_Node) which is the final goal because ... ###

Page1 -> Page2 -> Page3 -> Page4

C. Prompt: Link-Aware

In this prompt, the model is presented at each step with the explicit list of outgoing links from the current Wikipedia page and must choose one to move closer to the target page. No reasoning or explanation is required (only the chosen page name is output) thus closely mimicking the human decision process in an actual WikiGame session with visible navigation options. This mode directly tests the model’s ability to select valid and efective next steps when provided with local link context.

End_Node.

Your task: The user will provide a Start_Node and an End_Node and a List_Link_From_Start_Node, a list of page name linked from Start_Node .

You must make a unique choice with a page name from those proposed in List_Link_From_Start_Node, the page you choose must get you as close as possible from Start_Node to End_Node.

Make every time a choice to reach the End_Node.

Do not explain anything.

The only output should be: - A line containing ### - The unique page name choice, only one from the list List_Link_From_Start_Node - A final line containing @@@ Very Important Instruction: - Write only the page titles choice. - You must choice the page from the list List_Link_From_Start_Node - Do not include any reasoning or explanation. - Start your output with ### on a line by itself. - After the page name choice write a last line with @@@ - Don’t write the same page name of the Start_Node, you will lose. - Don’t write a page name that not is in the List_Link_From_Start_Node - Don’t change the case of page name, write in the same way is in the List_Link_From_Start_Node

D. Error Analysis

To illustrate typical model errors and their underlying causes, we present a qualitative analysis of failed navigation attempts in the Blind settings (No Reasoning and CoT), focusing on the most frequent error type: Invalid Link, where a transition is generated between two existing Wikipedia pages, but the corresponding hyperlink does not exist. Case 1: Semantic Plausibility without Structural Support Task: Medium dificulty, gpt-4o-mini (Blind - No Reasoning) Start: Germanium End: Rock_(geology) Generated Path: Germanium → Metalloid → Silicon → Rock_(geology) Error: No link from Silicon to Rock_(geology).

While Silicon and Rock_(geology) are closely related semantically, the Wikipedia page for Silicon does not link directly to Rock_(geology). In contrast, human players typically reach the target via longer, structurally valid paths, e.g., through Mineral, Earth’s crust, or Solid.

Case 2: Link Hallucination from Mentioned but Unlinked Entities Task: Very Hard, gpt-4.1 (Blind - Reasoning) Start: Clock End: Computing Generated Path: Clock → Computer → Computing Error: No link from Clock to Computer.

The model identifies Computer as conceptually relevant (and mentioned in the Clock article text), but this mention is not a hyperlink. Human solutions tend to traverse more granular technical or historical intermediates, yielding longer but valid paths.

Case 3: Overgeneralization of Conceptual Connections Task: Very Hard, gpt-4o-mini (Blind - Reasoning) Start: Clock End: Computing Generated Path: Clock → Time → Measurement → Computing Error: No link from Measurement to Computing.

Computing is present as a term within Measurement, but not as a direct hyperlink. The LLM overgeneralizes the apparent connection, skipping intermediate concepts that human players usually include.

Case 4: Surface Similarity versus Structural Reality Task: Medium, gpt-4o-mini (Blind - No Reasoning) Start: Diesel_engine End: Electric_charge Generated Path: Diesel_engine → Internal_combustion_engine → Electric_vehicle → Electric_charge Error: No link from Electric_vehicle to Electric_charge.

Although Electric_vehicle is strongly associated with Electric_charge in meaning, Wikipedia’s link structure does not provide a direct connection. Human players reliably reach the target via technical or physical intermediates such as Spark plug, Electric current, or Piezoelectricity.

Summary. Across these cases, LLMs display a tendency to infer links based on high-level conceptual associations or textual mentions rather than strictly adhering to Wikipedia’s hyperlink structure. This behavior is particularly evident in Blind settings, where models must rely on internalized world knowledge. In contrast, human players favor longer but structurally valid paths. These examples highlight a key challenge for LLM-based graph navigation: distinguishing plausible but invalid shortcuts from topologically feasible solutions.

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.

[1]

Brown ,

Mann ,

Ryder ,

Subbiah , J. D.

Amodei , Language models are few-shot learners, 5 . Conclusion and Future Work in: H. Larochelle , M.

Ranzato , R.

Hadsell , M. Balcan,

Our study has several limitations: the range of start- [5] I. Staliu¯naitė ,

P. J.

Gorinski , I. Iacobacci , Relational goal pairs and models is constrained by cost, and our met- graph convolutional neural networks for multihop rics focus on structural correctness rather than semantic reasoning: A comparative study, arXiv preprint nuance or user experience . Expanding to additional ar- arXiv:2210.06418 ( 2022 ). chitectures, with larger dimensions multilingual Wikis , [6]

West ,

Pineau ,

Precup , Wikispeedia: An onor richer evaluation criteria represents important future line game for inferring semantic distances between work. In summary, LLMs show strong but imperfect concepts ., in: IJCAI , volume 9 , 2009 , pp. 1598 - 1603 . generalization beyond memorization, with qualitative [7] A. G. et al., The llama 3 herd of models, strategy diferences persisting relative to humans . Future 2024 . URL: https://arxiv.org/abs/2407.21783. research should probe broader model families , alterna- arXiv: 2407 .21783. tive domains, and hybrid approaches that combine LLM [8 ]

Wei ,

Wang ,

Schuurmans ,

Bosma , E. H. reasoning with explicit graph traversal, as well as deeper Chi ,

Le ,

Zhou , Chain of thought prompting comparisons of human and model navigation strategies. elicits reasoning in large language models , CoRR

abs/2201 .11903 ( 2022 ). URL: https://arxiv.org/abs/

2201.11903. arXiv: 2201 .11903. Acknowledgments [9]

Arora ,

Gerlach ,

Piccardi ,

García-Durán ,

Data

Mining , WSDM '22 , ACM , 2022 , p. 16 - 26 .

doi:10.1145/3488560 .3498496.