Evidence for Systematic Bias in the Spatial Memory of Large Language Models Nir Fulman1,∗,† , Abdulkadir Memduhoğlu1,2,† and Alexander Zipf1,3 1 GIScience Chair, Institute of Geography, Heidelberg University, Heidelberg, Germany 2 Department of Geomatic Engineering, Faculty of Engineering, Harran University, Sanliurfa, Türkiye 3 HeiGIT at Heidelberg University, Heidelberg, Germany Abstract We report our initial findings from an examination of potential systematic bias in Large Language Models’ (LLM) spatial reasoning capabilities. We devised a series of questions to probe the spatial reasoning abilities of four LLMs: GPT-3.5, GPT-4, Gemini, and Llama-2, targeting four specific biases rooted in human spatial perception: hierarchical, proximity and directional biases. The questions encompassed scenarios challenging the models’ spatial reasoning, and each question was posed 10 times independently to gauge the consistency of the LLMs’ responses. The models demonstrated a strong understanding of straightforward geographical relationships, achieving 87% accuracy in questions that did not challenge biases in spatial reasoning. However, when faced with questions highlighting these biases, the models’ accuracy dropped to 24%. We discuss the design of a large-scale experiment aimed at examining spatial cognition biases in large language models and identifying potential mitigation strategies. Keywords geographic reasoning, large language models, cognitive maps, systematic bias 1. Introduction Recent studies primarily view Large Language Models (LLMs) in geography as tools linking natural language to geographic information systems [1]. However, Roberts et al. [2] showcased GPT-4’s [3] inherent ability to perform spatial reasoning tasks. They highlighted tasks that extend beyond mere recall of factual information, namely GPT-4’s proficiency in calculating the final destinations of routes based on initial locations, modes of transport, directions, and travel durations, without reliance on external processing engines. These capabilities open up practical applications such as creating personalized travel itineraries. Identifying the weaknesses of LLMs in spatial tasks may assist in guiding their development in this direction. We investigate the possibility that biases in human spatial reasoning may manifest in LLMs, focusing on four well-studied ones: (a) Hierarchical bias refers to the cognitive tendency to infer the direction between two points based on the dominant geographical orientation of their GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24, Glasgow, Scotland ∗ Corresponding author. † These authors contributed equally. Envelope-Open nir.fulman@uni-heidelberg.de (N. Fulman); memduhoglu@uni-heidelberg.de (A. Memduhoğlu); zipf@uni-heidelberg.de (A. Zipf) Orcid 0000-0002-2629-2641 (N. Fulman); 0000-0002-9072-869X (A. Memduhoğlu); 0000-0003-4916-9838 (A. Zipf) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings larger categorical groups (such as states or regions), leading to inaccuracies when exceptions to these general orientations exist [4]. (b) People often underestimate distances within the same categorical group, perceiving them as shorter than distances between different groups, even when the distances across groups are actually shorter [5]. (c) Rotation bias refers to the tendency to adjust the mental representation of geographical elements, aligning them more closely with conventional cardinal directions than their actual orientations [6]. This simplification leads to misconceptions about the true positions of locations, as individuals mentally ’rotate’ geographical layouts to fit a more straightforward, north-south/east-west alignment, irrespective of their true, more complex orientations. (d) Alignment bias refers to the cognitive inclination to overestimate the alignment of geographically grouped locations, leading to skewed perceptions of their actual latitudinal or longitudinal relationships [7]. LLMs rely on associative learning and contextual data processing to understand and generate human-like text [8]. These models may manifest systematic biases in spatial reasoning due to their training data and learning mechanisms. While human biases in spatial reasoning are rooted in mental mapping [7], which LLMs do not possess, we hypothesize that they may exhibit similar biases, based on three considerations. First, LLMs learn from textual data, and biases in human spatial reasoning can be present in textual descriptions of geography. Second, as humans generalize and simplify in their cognitive maps, leading to biases, they also do so in their textual descriptions of locations. LLMs, learning from such descriptions, could inherit and perpetuate these biases in their spatial reasoning abilities. For example, the US is often described as south of Canada, despite some areas within the US being located to the north. Third, LLMs might prioritize conceptual associations, such as assuming a ’west coast’ city to be the westernmost point without accounting for the curvature of the coastline. To investigate hierarchical bias in large language models (LLMs), we initially conducted a study with ten questions, five challenging questions where bias is likely to be exhibited based on scenarios where humans typically struggle, and five control questions to serve as a baseline for comparison. This study included four models: GPT-3.5 [3], GPT-4, LLaMA 2 [9], and Gemini 1.0 Pro [10], among which GPT-4 demonstrated superior performance. Based on this outcome, we narrowed our focus to GPT-4 for an analysis of the other types of bias. For each type, we formulated four questions, maintaining a balance between challenging and control scenarios. Each question in our study was posed ten times, employing a ’zero-shot’ mode to reset the model after every question, ensuring that responses remained uninfluenced by previous interactions. The questions were directly drawn from or inspired by well-known experiments in cognitive psychology literature, as referenced below. This paper extends the work of Fulman et al. [11], who provided evidence of hierarchical bias in LLMs. 2. Results The outcomes for hierarchical bias (a) are illustrated in Table 1. The models were instructed to determine intercardinal directions between cities, using the prompt: ’What is the intercardinal direction from [City A] to [City B]?’ For example, all models consistently (0/10) provided inaccurate directions between Portland and Toronto. We attribute the error to the general northward alignment of Canada relative to the United States (Figure 1a). This observation is in line with the findings of Stevens and Coupe [4], who observed a similar misperception in humans, presumably influenced by the overarching southward position of the United States relative to Canada, leading most to incorrectly assume Toronto is north of Portland. Conversely, when assessing the relationship between Dallas and San Antonio, both in Texas, the models consistently provided the correct answer (10/10). GPT-4 demonstrates the highest accuracy in this assessment, achieving a 75% success rate, followed by Gemini with 55%, GPT-3.5 with 53%, and LLaMA-2 at 47%. In scenarios designed to highlight hierarchical bias, GPT-4 distinctly outperforms its counterparts, registering a 50% accuracy rate. In comparison, Gemini scores 34%, GPT-3.5 26%, and LLaMA-2 only 10%. How- ever, when evaluating tasks absent of suspected hierarchical bias, all models exhibit improved performance, with accuracy rates exceeding 75%. The remainder of this study will focus on GPT-4 to explore further biases. Table 1 Overview of Hierarchical Bias and Model Performance Evaluation Bias Correct Susp. Bias Cities GPT4 GPT3.5 Gemini Llama Type Answer Bias Ratio Portland OR to Toronto CAN 0 0 0 0 Southeast Tijuana MEX to San Antonio TX 3 5 0 0 Southeast Wilmington NC to Philadelphia PA 10 0 10 5 Northeast Yes 30% Hierarchical San Diego CA to Reno NV 2 8 0 0 Northwest Memphis TN to Milwaukee WI 10 0 7 0 Northeast Santo Domingo DOM to Miami FL 10 10 0 10 Northwest Minneapolis MN to Chicago IL 10 10 9 10 Southeast Dallas TX to San Antonio TX 10 10 10 10 Southwest No 85% Havana CUB to Philadelphia PA 10 0 10 8 Northeast San Antonio TX to Houston TX 10 10 9 4 Northeast Model Performance 75% 53% 55% 47% The outcomes for biases (b) through (d) are presented in Table 2. To demonstrate proximity bias (b), GPT-4 was tasked with evaluating the relative distances between cities, employing the query: ’Which is closer to [City X]: [City A] or [City B]?’ For instance, despite New Haven, Connecticut being closer to Philadelphia, Pennsylvania by both road distance (∼250km) and great circle measurements, the model consistently determined that Pittsburgh, Pennsylvania is the closer city (∼450km) (0/10) (Figure 1b). However, when New Haven is replaced with Johnstown, which is ∼390km from Philadelphia in Pennsylvania, the model consistently gives the correct answer (10/10). This possibly reflects a bias of perceiving distances within the state as shorter than across states. In examining the rotation bias (c), the model was asked: ’Which city is further west, [City A] or [City B]?’ For instance, when inquiring which city is further west between Wilmington, North Carolina and Jacksonville, North Carolina, the model mistakenly (2/10) pointed to the latter, possibly reflecting a simplification of the US east coast curvature (Figure 1c). However, it correctly identified the relative westward position when comparing Wilmington to Morehead City, North Carolina, both being coastal cities, possibly suggesting that the presence of a common geographical feature, forces more precise comparisons (10/10). Table 2 Overview of question types and model performance evaluation Bias Correct Susp. Bias Cities GPT4 Type Answer Bias Ratio Philadelphia PA: Pittsburgh PA or New Haven CN 0 New Haven Proximity Yes 0 Dallas TX: Houston TX or Oklahoma City OK 0 Oklahoma City Dallas TX: Houston TX or Austin TX 10 Austin No 100% Philadelphia PA: Johnstown PA or Pittsburgh PA 10 Johnstown San Diego CA or Fresno CA 0 Fresno Yes 10% Rotation Wilmington NC or Jacksonville NC 2 Wilmington Wilmington NC or Morehead City NC 10 Wilmington No 75% Los Angeles CA or San Francisco CA 5 San Francisco Monaco MCO to Chicago IL 0 Southwest Alignment Yes 0 Rome ITA to Philadelphia PA 0 Southwest Lisbon PRT to New York City NY 10 Northwest No 100% Madrid ESP to Boston MA 10 Northwest Alignment bias (d) was examined through the query: ’What are the intercardinal directions from [City A] to [City B]?’ For example, the model inaccurately determined the direction between Monaco, situated in the southern part of Europe, and Chicago, located in the northern United States (0/10). This error may reflect the common misconception that North America and Europe align on the same east-west axis, while in reality, Europe is predominantly north of the United States [7] (Figure 1d). However, it correctly ascertained the direction from Lisbon to New York City, which is straightforward because Lisbon is indeed to the south of New York City (10/10). 3. Discussion We report our initial findings from an examination of potential systematic bias in the spatial reasoning capabilities of GPT-3.5, GPT-4, Gemini, and Llama-2. The models show distinct patterns in their performance: They achieve 87% accuracy in questions that do not challenge biases in spatial reasoning, indicating a strong understanding of straightforward geographical relationships. On the other hand, they only achieved a 24% accuracy rate in the questions which highlight these biases. Our current study draws from the psychology experiments that served as our inspiration; however, it does not offer a statistically valid assessment of how biases in spatial perception impact LLMs. To address this limitation, our future exploration will focus on the proximity bias, denoted as (b), which relates to the tendency to underestimate distances within categories while overestimating distances between them. This analysis will involve querying the models with hundreds of relevant cities and examining their interrelationships. While our approach may allow us to verify the existence of biases in LLM’s spatial reasoning skills, we may not be able to pinpoint the source of these biases – whether they originate from Figure 1: Illustration of cities demonstrating the four types of bias: (a) hierarchical, (b) proximity, (c) rotation, and (d) alignment learned human errors, generalized geographical input data, or the models’ inherent tendencies towards conceptual associations. Nevertheless, we may be able to mitigate these issues. One potential strategy involves training LLMs with datasets explicitly detailing spatial relationships between various locations. Utilizing Natural Language Geographic Data for this purpose could usher in a deliberate development of spatial reasoning skills within these models, enabling them to more accurately comprehend and process geographic relationships. In the next phase of our research, we plan to explore methods for fine-tuning an open-source LLM using such spatially explicit datasets, evaluating its ability to discern intercardinal directions and ultimately enhance its spatial reasoning capabilities. Acknowledgments N. Fulman was supported by the Health + Life Science Alliance Heidelberg Mannheim and received state funds approved by the State Parliament of Baden-Württemberg. A. Memduhoğlu was supported by the Scientific and Technological Research Council of Türkiye (TUBITAK) under the program 2219 (1059B192202917). References [1] Y. Zhang, C. Wei, S. Wu, Z. He, W. Yu, Geogpt: Understanding and processing geospatial tasks through an autonomous gpt, arXiv preprint arXiv:2307.07930 (2023). [2] J. Roberts, T. Lüddecke, S. Das, K. Han, S. Albanie, Gpt4geo: How a language model sees the world’s geography, arXiv preprint arXiv:2306.00020 (2023). [3] OpenAI, Models: Gpt-4 turbo (128k context window) and gpt-3.5 turbo (16k context window), 2023. URL: https://platform.openai.com/docs/models, version November 21, 2023. [4] A. Stevens, P. Coupe, Distortions in judged spatial relations, Cognitive Psychology 10 (1978) 422–437. [5] L. P. Acredolo, L. T. Boulter, Effects of hierarchical organization on children’s judgments of distance and direction, Journal of Experimental Child Psychology 37 (1984) 409–425. [6] L. G. Braine, A new slant on orientation perception, American Psychologist 33 (1978) 10. [7] B. Tversky, Distortions in cognitive maps, Geoforum 23 (1992) 131–138. [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need 30 (2017). [9] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, T. ... Scialom, Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). [10] G. Team, R. Anil, S. Borgeaud, Y. Wu, J. B. Alayrac, J. Yu, J. ... Ahn, Gemini: a family of highly capable multimodal models, arXiv preprint arXiv:2312.11805 (2023). [11] N. Fulman, A. Memduhoğlu, A. Zipf, Distortions in judged spatial relations in large language models: The dawn of natural language geographic data?, arXiv preprint arXiv:2401.04218 (2024).