<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>N. Fulman);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Evidence for Systematic Bias in the Spatial Memory of Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nir Fulman</string-name>
          <email>nir.fulman@uni-heidelberg.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abdulkadir Memduhoğlu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Zipf</string-name>
          <email>zipf@uni-heidelberg.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Glasgow, Scotland</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Geomatic Engineering, Faculty of Engineering, Harran University</institution>
          ,
          <addr-line>Sanliurfa</addr-line>
          ,
          <country country="TR">Türkiye</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>GIScience Chair, Institute of Geography, Heidelberg University</institution>
          ,
          <addr-line>Heidelberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>HeiGIT at Heidelberg University</institution>
          ,
          <addr-line>Heidelberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>We report our initial findings from an examination of potential systematic bias in Large Language Models' (LLM) spatial reasoning capabilities. We devised a series of questions to probe the spatial reasoning abilities of four LLMs: GPT-3.5, GPT-4, Gemini, and Llama-2, targeting four specific biases rooted in human spatial perception: hierarchical, proximity and directional biases. The questions encompassed scenarios challenging the models' spatial reasoning, and each question was posed 10 times independently to gauge the consistency of the LLMs' responses. The models demonstrated a strong understanding of straightforward geographical relationships, achieving 87% accuracy in questions that did not challenge biases in spatial reasoning. However, when faced with questions highlighting these biases, the models' accuracy dropped to 24%. We discuss the design of a large-scale experiment aimed at examining spatial cognition biases in large language models and identifying potential mitigation strategies.</p>
      </abstract>
      <kwd-group>
        <kwd>geographic reasoning</kwd>
        <kwd>large language models</kwd>
        <kwd>cognitive maps</kwd>
        <kwd>systematic bias</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Recent studies primarily view Large Language Models (LLMs) in geography as tools linking
natural language to geographic information systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, Roberts et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] showcased
GPT-4’s [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] inherent ability to perform spatial reasoning tasks. They highlighted tasks that
extend beyond mere recall of factual information, namely GPT-4’s proficiency in calculating the
ifnal destinations of routes based on initial locations, modes of transport, directions, and travel
durations, without reliance on external processing engines. These capabilities open up practical
applications such as creating personalized travel itineraries. Identifying the weaknesses of
LLMs in spatial tasks may assist in guiding their development in this direction.
      </p>
      <p>
        We investigate the possibility that biases in human spatial reasoning may manifest in LLMs,
focusing on four well-studied ones: (a) Hierarchical bias refers to the cognitive tendency to
infer the direction between two points based on the dominant geographical orientation of their
∗Corresponding author.
†These authors contributed equally.
larger categorical groups (such as states or regions), leading to inaccuracies when exceptions
to these general orientations exist [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. (b) People often underestimate distances within the
same categorical group, perceiving them as shorter than distances between diferent groups,
even when the distances across groups are actually shorter [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. (c) Rotation bias refers to
the tendency to adjust the mental representation of geographical elements, aligning them
more closely with conventional cardinal directions than their actual orientations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This
simplification leads to misconceptions about the true positions of locations, as individuals
mentally ’rotate’ geographical layouts to fit a more straightforward, north-south/east-west
alignment, irrespective of their true, more complex orientations. (d) Alignment bias refers to
the cognitive inclination to overestimate the alignment of geographically grouped locations,
leading to skewed perceptions of their actual latitudinal or longitudinal relationships [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        LLMs rely on associative learning and contextual data processing to understand and generate
human-like text [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These models may manifest systematic biases in spatial reasoning due
to their training data and learning mechanisms. While human biases in spatial reasoning are
rooted in mental mapping [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which LLMs do not possess, we hypothesize that they may exhibit
similar biases, based on three considerations. First, LLMs learn from textual data, and biases
in human spatial reasoning can be present in textual descriptions of geography. Second, as
humans generalize and simplify in their cognitive maps, leading to biases, they also do so in
their textual descriptions of locations. LLMs, learning from such descriptions, could inherit
and perpetuate these biases in their spatial reasoning abilities. For example, the US is often
described as south of Canada, despite some areas within the US being located to the north.
Third, LLMs might prioritize conceptual associations, such as assuming a ’west coast’ city to be
the westernmost point without accounting for the curvature of the coastline.
      </p>
      <p>
        To investigate hierarchical bias in large language models (LLMs), we initially conducted a
study with ten questions, five challenging questions where bias is likely to be exhibited based
on scenarios where humans typically struggle, and five control questions to serve as a baseline
for comparison. This study included four models: GPT-3.5 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], GPT-4, LLaMA 2 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and Gemini
1.0 Pro [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], among which GPT-4 demonstrated superior performance. Based on this outcome,
we narrowed our focus to GPT-4 for an analysis of the other types of bias. For each type, we
formulated four questions, maintaining a balance between challenging and control scenarios.
Each question in our study was posed ten times, employing a ’zero-shot’ mode to reset the model
after every question, ensuring that responses remained uninfluenced by previous interactions.
The questions were directly drawn from or inspired by well-known experiments in cognitive
psychology literature, as referenced below. This paper extends the work of Fulman et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
who provided evidence of hierarchical bias in LLMs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Results</title>
      <p>The outcomes for hierarchical bias (a) are illustrated in Table 1. The models were instructed to
determine intercardinal directions between cities, using the prompt: ’What is the intercardinal
direction from [City A] to [City B]?’ For example, all models consistently (0/10) provided
inaccurate directions between Portland and Toronto. We attribute the error to the general
northward alignment of Canada relative to the United States (Figure 1a). This observation is</p>
      <sec id="sec-3-1">
        <title>Bias</title>
        <p>Type
l
a
c
i
h
c
r
a
r
e
i
H</p>
      </sec>
      <sec id="sec-3-2">
        <title>Portland OR to Toronto CAN</title>
      </sec>
      <sec id="sec-3-3">
        <title>Tijuana MEX to San Antonio TX</title>
      </sec>
      <sec id="sec-3-4">
        <title>Wilmington NC to Philadelphia PA</title>
      </sec>
      <sec id="sec-3-5">
        <title>San Diego CA to Reno NV</title>
      </sec>
      <sec id="sec-3-6">
        <title>Memphis TN to Milwaukee WI</title>
      </sec>
      <sec id="sec-3-7">
        <title>Santo Domingo DOM to Miami FL</title>
      </sec>
      <sec id="sec-3-8">
        <title>Minneapolis MN to Chicago IL</title>
      </sec>
      <sec id="sec-3-9">
        <title>Dallas TX to San Antonio TX</title>
      </sec>
      <sec id="sec-3-10">
        <title>Havana CUB to Philadelphia PA</title>
        <p>
          San Antonio TX to Houston TX
in line with the findings of Stevens and Coupe [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], who observed a similar misperception in
humans, presumably influenced by the overarching southward position of the United States
relative to Canada, leading most to incorrectly assume Toronto is north of Portland. Conversely,
when assessing the relationship between Dallas and San Antonio, both in Texas, the models
consistently provided the correct answer (10/10).
        </p>
        <p>GPT-4 demonstrates the highest accuracy in this assessment, achieving a 75% success rate,
followed by Gemini with 55%, GPT-3.5 with 53%, and LLaMA-2 at 47%. In scenarios designed
to highlight hierarchical bias, GPT-4 distinctly outperforms its counterparts, registering a 50%
accuracy rate. In comparison, Gemini scores 34%, GPT-3.5 26%, and LLaMA-2 only 10%.
However, when evaluating tasks absent of suspected hierarchical bias, all models exhibit improved
performance, with accuracy rates exceeding 75%. The remainder of this study will focus on
GPT-4 to explore further biases.</p>
        <p>The outcomes for biases (b) through (d) are presented in Table 2. To demonstrate proximity
bias (b), GPT-4 was tasked with evaluating the relative distances between cities, employing the
query: ’Which is closer to [City X]: [City A] or [City B]?’ For instance, despite New Haven,
Connecticut being closer to Philadelphia, Pennsylvania by both road distance (∼250km) and
great circle measurements, the model consistently determined that Pittsburgh, Pennsylvania
is the closer city (∼450km) (0/10) (Figure 1b). However, when New Haven is replaced with
Johnstown, which is ∼390km from Philadelphia in Pennsylvania, the model consistently gives
the correct answer (10/10). This possibly reflects a bias of perceiving distances within the state
as shorter than across states.</p>
        <p>In examining the rotation bias (c), the model was asked: ’Which city is further west, [City
A] or [City B]?’ For instance, when inquiring which city is further west between Wilmington,
North Carolina and Jacksonville, North Carolina, the model mistakenly (2/10) pointed to the
latter, possibly reflecting a simplification of the US east coast curvature (Figure 1c). However, it
correctly identified the relative westward position when comparing Wilmington to Morehead
City, North Carolina, both being coastal cities, possibly suggesting that the presence of a
common geographical feature, forces more precise comparisons (10/10).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Discussion</title>
      <p>We report our initial findings from an examination of potential systematic bias in the spatial
reasoning capabilities of GPT-3.5, GPT-4, Gemini, and Llama-2. The models show distinct
patterns in their performance: They achieve 87% accuracy in questions that do not challenge
biases in spatial reasoning, indicating a strong understanding of straightforward geographical
relationships. On the other hand, they only achieved a 24% accuracy rate in the questions which
highlight these biases.</p>
      <p>Our current study draws from the psychology experiments that served as our inspiration;
however, it does not ofer a statistically valid assessment of how biases in spatial perception
impact LLMs. To address this limitation, our future exploration will focus on the proximity
bias, denoted as (b), which relates to the tendency to underestimate distances within categories
while overestimating distances between them. This analysis will involve querying the models
with hundreds of relevant cities and examining their interrelationships.</p>
      <p>While our approach may allow us to verify the existence of biases in LLM’s spatial reasoning
skills, we may not be able to pinpoint the source of these biases – whether they originate from
learned human errors, generalized geographical input data, or the models’ inherent tendencies
towards conceptual associations. Nevertheless, we may be able to mitigate these issues. One
potential strategy involves training LLMs with datasets explicitly detailing spatial relationships
between various locations. Utilizing Natural Language Geographic Data for this purpose could
usher in a deliberate development of spatial reasoning skills within these models, enabling them
to more accurately comprehend and process geographic relationships. In the next phase of our
research, we plan to explore methods for fine-tuning an open-source LLM using such spatially
explicit datasets, evaluating its ability to discern intercardinal directions and ultimately enhance
its spatial reasoning capabilities.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>N. Fulman was supported by the Health + Life Science Alliance Heidelberg Mannheim and
received state funds approved by the State Parliament of Baden-Württemberg.
A. Memduhoğlu was supported by the Scientific and Technological Research Council of Türkiye
(TUBITAK) under the program 2219 (1059B192202917).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Geogpt: Understanding and processing geospatial tasks through an autonomous gpt</article-title>
          ,
          <source>arXiv preprint arXiv:2307.07930</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lüddecke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          , K. Han,
          <string-name>
            <surname>S</surname>
          </string-name>
          . Albanie,
          <article-title>Gpt4geo: How a language model sees the world's geography</article-title>
          ,
          <source>arXiv preprint arXiv:2306.00020</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] OpenAI, Models: Gpt-4 turbo (128k context window) and gpt-3.5 turbo (16k context window</article-title>
          ),
          <year>2023</year>
          . URL: https://platform.openai.com/docs/models, version November 21,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Coupe</surname>
          </string-name>
          ,
          <article-title>Distortions in judged spatial relations</article-title>
          ,
          <source>Cognitive Psychology 10</source>
          (
          <year>1978</year>
          )
          <fpage>422</fpage>
          -
          <lpage>437</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Acredolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. T.</given-names>
            <surname>Boulter</surname>
          </string-name>
          ,
          <article-title>Efects of hierarchical organization on children's judgments of distance and direction</article-title>
          ,
          <source>Journal of Experimental Child Psychology</source>
          <volume>37</volume>
          (
          <year>1984</year>
          )
          <fpage>409</fpage>
          -
          <lpage>425</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Braine</surname>
          </string-name>
          ,
          <article-title>A new slant on orientation perception</article-title>
          ,
          <source>American Psychologist</source>
          <volume>33</volume>
          (
          <year>1978</year>
          )
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Tversky</surname>
          </string-name>
          , Distortions in cognitive maps,
          <source>Geoforum</source>
          <volume>23</volume>
          (
          <year>1992</year>
          )
          <fpage>131</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , et al.,
          <source>Attention is all you need 30</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. ...</given-names>
            <surname>Scialom</surname>
          </string-name>
          ,
          <article-title>Llama 2: Open foundation and fine-tuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Alayrac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. ...</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <article-title>Gemini: a family of highly capable multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2312.11805</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Memduhoğlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zipf</surname>
          </string-name>
          ,
          <article-title>Distortions in judged spatial relations in large language models: The dawn of natural language geographic data?</article-title>
          ,
          <source>arXiv preprint arXiv:2401.04218</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>