<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Stillman);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Geolocation in LLMs: Experiments on Probing LLMs for Geographic Knowledge and Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mila Stillman</string-name>
          <email>mila.stillman@hm.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Kruspe</string-name>
          <email>anna.kruspe@hm.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hochschule München</institution>
          ,
          <addr-line>Lothstraße 64, München</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Large Language Models</institution>
          ,
          <addr-line>Bias, Geospatial data, Reasoning, Fairness</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2041</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Geographic biases in Large Language Models (LLMs) are evident. Research has shown that both the training data and outputs of LLMs are skewed towards western and economically afluent countries, resulting in the underrepresentation of certain regions. In addition, LLMs are prone to hallucinations, which can lead to the generation of incorrect or fabricated information. In this paper, we present an additional analysis of the geographic knowledge and geospatial reasoning of LLMs through two experiments carried out in English on a global scale. Specifically, we evaluate the geospatial capabilities of four open-source LLMs, namely: Llama 2-7B, Llama 3 (8B and 70B) and Phi-3-mini-4k, and demonstrate that geographic knowledge within LLMs is unevenly distributed across diferent regions of the world. This imbalance could lead to unfair treatment of certain areas and impact various applications that use geographic knowledge, including mobility and remote sensing applications, that aim to use LLMs for data analysis and decision-making.</p>
      </abstract>
      <kwd-group>
        <kwd>Reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Italy</p>
      <p>CEUR</p>
      <p>ceur-ws.org
geography and advanced planning abilities using a fixed number of stops in diferent countries. Here,
we test the ability of LLMs to combine those reasoning skills and analyze the efect of diferent starting
points on performance.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Current LLMs use Transformers [20] as their backbone architecture. They are black boxes and are not
interpretable without additional components [21]. These models are trained on large textual datasets,
or training corpora. These often contain inherent selection biases, leading to models that misrepresent
both the underlying data and the real-world phenomena they aim to capture [13]. For instance,
crowdsourced geographic data on platforms such as OpenStreetMap [22] and Wikipedia[23, 24], have systemic
biases and an uneven distribution of geographic information. In addition, geographic information from
geotagged social media posts is mainly skewed towards urban and wealthy regions [
        <xref ref-type="bibr" rid="ref4">25, 26, 4</xref>
        ].
      </p>
      <p>Geographic knowledge in LLMs is an increasingly studied field, especially since spatial and temporal
representations have been demonstrated in LLMs [9]. Furthermore, 18.7% of the Common Crawl
corpus used to train LLMs have been estimated to contain geospatial information such as addresses
and geocoordinates [27]. Research reveals the potential to use LLMs in applications, such as extracting
geocoordinates of cities and locations [28, 11], trip itineraries [11], and the use of GIS agents to automate
geospatial tasks [29]. However, investigations into the geographic knowledge of these models show
limitations, distance distortions, and inequality between regions [30, 14, 31, 28, 11, 32]. [33] researched
the geographic knowledge of ChatGPT by measuring its results while taking a Geographic Information
Systems (GIS) exam. The research shows that the GPT-3.5 and GPT-4 models achieve test scores of 66%
and 88.3%, respectively. Studies testing GPT-4’s capabilities in route planning and geocoded information
retrieval reveal a certain level of success on geospatial tasks. However, there are some limitations in
abstract reasoning, suggesting that memorization could play a role in task performance [32, 11]. [34]
shows that LLMs demonstrate confidence in simple route planning tasks using cognitive maps; however,
the authors suggest that this confidence is likely attributed to memorized routes rather than a genuine
understanding of cognitive maps, route planning strategies or inference capabilities. Furthermore, the
authors indicate that LLMs tend to fail due to hallucinations, constructing too long routes in some cases,
or getting stuck in loops. The authors in [35] tested the factuality of LLMs on a global scale using data
according to the World Bank and found significant biases for countries with lower income levels and
in certain regions. This paper is a continuation of the work on identifying and quantifying existing
geographic biases in LLMs. Future mobility applications, such as route planning and navigation, as well
as the customization of LLMs to diferent geographic locations, require precise geographic knowledge.
Moreover, any potential bias or discriminatory treatment of certain locations by LLMs could pose harm
to individuals and therefore requires a careful examination.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Our experimental setup includes probing and comparing the output of four LLMs for extraction of
georeferenced information. In the first experiment, instead of asking LLMs to indicate the geocoordinates
of known locations, e.g., cities or points of interest, as has been done in previous research, we conduct
a reverse geocoding experiment by selecting geocoordinates in a randomized manner and ask the
LLMs to ’guess’ their location. This task is not common in texts; however, intelligent models that
have an embedded geographic component should be able to make these educated guesses, similarly to
human geographers. The models are downloaded from Huggingface [36], namely Llama 2 [37] with
7B parameters, Llama 3 [38] with 8B and 70B parameters, and Phi 3 mini[39] with 4k parameters. The
use of the four models is subject to their respective license agreements. The probing experiments are
conducted using the transformers library from Hugging Face [40]. We choose open-source models
due to the ability to run these locally without additional API costs. All experiments are run using two
Nvidia RTX A6000 GPUs. The exact models used are as follows: Llama-2-7b-chat-hf,
Meta-Llama-3.18B-Instruct, Meta-Llama-3.3-70B-Instruct, and Phi-3-mini-4k-instruct. For Llama 2, the chat model was
selected due to its additional human-feedback training, which improved its performance on benchmark
datasets, compared to the original model. The instruction tuned models are otherwise used due to
their improved performance on common industry benchmarks compared to base or chat models. The
prompts are adjusted to fit the geographic context via a system prompt. Programming is excluded in
this task, since models tend to use code to download and process geospatial data. Instead, to extract
potential inherent biases, we probe the implicit geographic knowledge that LLMs already possess. The
answer ’not available’ is acceptable when the LLMs are unable to provide an answer. The prompt is
adjusted to fit the template provided by each model, while the prompt text remains the same. The
number of tokens is limited to 1000. The temperature values are the default values: for Llama 2 and
Llama 3 models the temperature value is 0.6. For Phi-3 the temperature was set to 0.0 as suggested
for inference in the model’s card on Hugging Face. The Llama 3-70B model was quantized to 8-bit for
memory optimization using bitsandbytes from transformers [40]. The text of the system prompt is as
follows:
Your r o l e i s an e x p e r t g e o g r a p h e r who i s f a m i l i a r w i t h t h e
g e o c o o r d i n a t e s y s t e m and w o r l d g e o g r a p h y . P r o v i d e s h o r t and
c o n c i s e a n s w e r s t o t h e q u e s t i o n you a r e a s k e d . Programming i s n o t
a l l o w e d . I f you do n o t know t h e answer , a n s w e r w i t h ’ n o t
a v a i l a b l e ’ .</p>
      <p>In both experiments, we used the 177 countries from the naturalearth_lowres dataset using
the Geopandas [41] library in Python. In the first experiment, for each country, we generate
random uniformly distributed geocoordinates within its bounding box. We use polygon data from the
'naturalearth_lowres' dataset to keep only points that fall on land within the countries’ polygons,
removing any in the oceans, until reaching exactly 20 points per country. For each pair of generated
geocoordinates, we ask the LLMs to indicate the country to which they belong. The text of the user
prompt is provided below:
You w i l l be g i v e n a s e t o f g e o c o o r d i n a t e s i n t h e form o f ( l a t , l o n g ) .</p>
      <p>P r o v i d e t h e c o u n t r y t o which t h e s e g e o c o o r d i n a t e s b e l o n g . F i r s t ,
i d e n t i f y t h e c o u n t r y name , t h e n p r o v i d e o n l y t h e c o u n t r y c o d e i n
ISO 3 1 6 6 f o r m a t .</p>
      <p>Here a r e t h e g e o c o o r d i n a t e s :</p>
      <p>In both experiments, the LLMs are asked to identify the country name and provide the country code
in ISO-3166 format. This approach to prompting allowed us to improve robustness and avoid diferences
in country names from abbreviations and other toponyms.</p>
      <p>In the second experiment, we examine the ability of LLMs to plan routes on a global scale. We
prompt the LLMs to plan a trip itinerary around the world. In other words, define a route that will
circumnavigate the Earth and return to the starting point, by using any means of travel. Curating such
a trip requires reasoning capabilities, understanding geography on a global scale, applying a circular
shape to the route, while comprehending that the Earth is spherical. To assess potential biases in this
task, we ran this experiment using a diferent starting point each time, that is, starting from each of the
177 countries in the dataset. The number of stops was limited to 12 countries. The LLMs are also asked
to provide the geocoordinates at each stop. To prevent ambiguities associated with the phrase ”around
the world,” which could imply visiting various locations globally without necessarily completing a full
circle around the Earth, the experiment is repeated with revised wording, changing the phrase ”a trip
around the world” to ”a trip circumnavigating the Earth”. An example of the desired geocoordinate
format was provided to encourage its adoption, due to the models’ tendency to provide various formats
of geocoordinates. Given the example, this tendency was reduced, but not fully eliminated. In this
experiment, the user prompt is as follows:</p>
      <p>In both experiments, the prompt is designed to make geospatial predictions in a zero-shot manner.
Multiple prompt variations were explored to achieve concise results that convey the necessary
information. Further improvements in the form of prompt engineering are left for future work. The code is
available at github.com/Milast/geo_biases_llms.</p>
      <p>During post-processing, the pycountry library is used to identify country codes in ISO-3166 format
within the free text given by the LLMs. The rate of missing, or incorrect, values in the output of the
countries by the LLMs in the first experiment is less than 2% for all models. We consider this error
rate acceptable when working with generated text. This is partially caused by the option given to the
models to answer with ’not available’, as well as due to the LLMs providing false or no country codes
in rare cases. The country Kosovo was manually added since it does not have an ISO-3166 code. The
geocoordinates generated in the second experiment in some cases did not match the coordinates of the
provided country or were beyond the limits of longitude and latitude values. Further inspection and
analysis of false geocoding is left for future work.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <sec id="sec-4-1">
        <title>4.1. First experiment</title>
        <p>First, we analyze whether the LLMs are able to correctly identify the countries. When asking LLMs for
both country name and country code, all models had a high preference towards providing an answer
rather than answering with the option ’not available’. The percentage of missing answers due to the
optional ’not available’ answer or incorrect country codes is provided in Table 1. The Llama 2 model
exhibited the most biased behavior, as well as proved to be the least knowledgeable. The continents
with the best performance are Europe and North America, followed by Oceania and South America.
Asia had a relatively low score, and all models performed the worst in recognizing countries in Africa.
This goes in line with previous research that studied distorted distances and geographic biases in this
region. Both Antarctica and Seven Seas (open ocean) continents in the dataset have a single country
each and demonstrated low accuracies. This is expected since those are remote areas that most likely
do not appear often in the training data of LLMs. Surprisingly, the larger model of Llama 3 with 70B
parameters performed worse in this task than the Llama 3 with 8B parameters and the much smaller
Phi-3-mini-4k model. As a post-processing step, we analyze whether the country with which the LLMs
respond belongs to the same continent as the correct country. Here, we notice a better performance
for all continents. The Phi-3-mini-4k and Llama 3-8B models demonstrate high accuracies in Africa
compared to the other two models. This suggests that LLMs do possess a certain understanding of
geolocation on a continent level. The results for the three models are presented in Figure 1, Figure 2,
and Table 1.</p>
        <p>The Llama 2 model has the lowest diversity of countries given as output and the lowest number
of countries which were correctly identified at least once. Interestingly, the Llama 3 model with 70B
parameters has a lower diversity of countries than the both the Llama3-8B and the Phi-3 model. For
Llama 2, countries such as Indonesia, South Africa, Turkey and Germany are frequently used. In Llama
3-8B, the most frequently used countries were South Africa and Mozambique, and the Llama 3 70B
model frequently used China and Italy. Surprisingly, the Phi-3-mini-4k model most frequently used
South Africa, Turkey, Ghana and Kenya. The top 20 countries and their frequency of occurrence for
k
4
i
n
i
3
i
h
P
m
B
0
7
m
a
l
L
B
8
1
m
a
l
L
B
7
2
a
m
a
l
L</p>
        <p>0
h 2
c</p>
        <p>o
ea ,t
n )
i e
h d
t
i a</p>
        <p>h
w s
s t
e s
t e
a t
n h</p>
        <p>g
i
d li
r
o e
o h
c t
o (
e 0
g</p>
        <p>n
om ee
d w
an te
r b
f e
o g
t n
se ra
a s</p>
        <p>e
i
r
t
n i</p>
        <p>n
u i</p>
        <p>4
o
c
d
ife
i
t
m
3
i
h
n P
e
d d
i</p>
        <p>n
y a
l
t
c B
e 0
r 7
r
o 3
c a
f
o m</p>
        <p>a
r l
e L
b ,
m B
u -8
N 3
:
1
e
r
u
l
L
,
B
7
2
a
m
a
l
L
:
s
l
e
d
o
e
g
e B
d 0
i
7
ly 3
t
c a
e m
r a
r l
o L
c ,
f B
o 8
r
e 3
b a
m m
u la
N L
:
2
e
r
u
g</p>
        <p>Llama 3-70B
each model are presented in Table 3 in Appendix A. We also analyze the relationship between the
frequency of countries provided by the LLMs and their Gross Domestic Product (GDP) and population
estimate. The Llama 3-8B and the Phi 3 models show a weak correlation with population estimate,
while both Llama 3-70B and Llama 2-7B show a moderate correlation. The correlation with GDP is
most prominent in the Llama 3-70B model. We calculate the Pearson correlation coeficient and find
significant p-values for both GDP and population estimate in all models except Phi-3, with p-values of
0.047 and 0.075 for GDP and population estimate, respectively. The Pearson correlation coeficients
calculated and the corresponding p-values are summarized in Table 2. Further analysis of correlation
with travel data or other training data was left for future work.</p>
        <p>Figure 3 indicates the GDP and population estimate as a function of how often they were mentioned
by the LLMs.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Second experiment</title>
        <p>In the second experiment, the LLMs generate trip itineraries for a trip around the world. We calculate
and analyze the average total distance traveled using the Haversine Distance formula [42] and find that
it is much longer in the Llama 3-70B and Phi-3 models, with an average distance traveled resembling
the Earth’s circumference of around 40,000 km or higher. In comparison, the Llama 2 and Llama 3-8B
models’ average distances were close to 30,000 km or shorter. Moreover, the choice of wording, or
lexical choice, has an efect on the average travel distance. For example, in the Phi 3 model the distance
is shorter with the word ’circumnavigate’, while for Llama 2 and Llama 3-8B it is longer. For Llama 2
and Phi 3, the average distance traveled from African countries is the shortest, followed by Asia and
Europe. The distance traveled by Llama 3-70B is robust to changes in language and to starting countries
on diferent continents.</p>
        <p>Furthermore, the percentage of countries chosen for the trip around the world that belong to the
same continent as that of the starting country is found to be significantly higher in Africa for Phi-3 and
Llama 2, irrespective of the syntactic variation and without repeating countries. In terms of the shape of
the trips, the qualitative results suggest that Llama 3-70B can best capture a rounded trip. Examples of
trip shapes can be found in the Appendix 6. The average distance traveled and the percentage of stops
within the same continent as the starting point are presented in Figure 4 and Figure 5, respectively.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Some of the inaccuracies generated by LLMs are expected, due to the training data of LLMs, which
includes geographic data (e.g., from Wikipedia, social media, etc.) that are potentially skewed towards
western and more afluent parts of society. Although LLMs might not possess, or be able to interpret,
polygon information of countries to perform reverse geocoding accurately, the models did attempt to
guess the country in most cases, even when given an option to answer with ’not available’. This efect
could partially be due to the choice of models, prompts, and model parameters, which could be further
optimized in the future. Surprisingly, the Llama 2 model has a small number of selected countries,
which could be the result of the diference in the size of the training data compared to the other models.
The Phi-3-mini-4k, a much smaller model, has competitive performance in terms of accuracy and bias.
Surprisingly, the least biased model was Llama 3-8B, and performed even better than the Llama 3-70B
model in the first experiment. This indicates that perhaps a larger model is not necessarily superior to
smaller models in the geographic information extraction task. However, in the second experiment the
larger model proved to be more robust to lexical choice and provided more circular routes than the
other models, suggesting improved geospatial reasoning capabilities. The Phi-3 model generated longer
trips too; however, it was less robust to formulation and formed trip shapes that are more chaotic. Some
potential explanations for less accurate routes could include travel restrictions to certain countries
and not enough training data from people traveling around the world from these regions. Another
interesting result is that while toponym resolution remains a challenge in GeoAI, specifically when
probing LLMs for geographic knowledge, we found that unifying country names using country codes
worked well. Nevertheless, the geocoordinates provided by the LLMs had diferent formats and required
manual work and resolution. In some cases, they were beyond the geographic limits, or misrepresented
the complementary country. The optimization, verification and standardization of the output of LLMs
in the geographic context could be a future research direction.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Large Language Models exhibit biases, including geographic biases. In this paper, we demonstrate that
the geographic knowledge of LLMs is partially inaccurate and biased against less economically strong
regions of the world. We conducted experiments with four LLMs to assess their ability to perform
two geospatial tasks. We find that the representation of objective geographic knowledge is unequal
between regions and that some countries are overrepresented. Future work includes testing more LLMs,
using multilingual prompts and chain-of-thought prompts with provided polygon information, a larger
amount of generated geocoordinates, and prompts using diferent geographic granularity. In addition,
spatial fairness is an important aspect in avoiding common pitfalls [43]. Since random geocoordinates
are used in the first experiment, it is crucial to conduct a thorough analysis of these locations. For
instance, in densely populated areas, random geocoordinates may provide more information compared
to those in rural or sparsely populated regions. Furthermore, conducting correlation analyzes with data
sources such as worldwide travel data and other sources of LLMs’ training data could be beneficial.
Finally, bias in LLMs afects a variety of emerging socially critical applications, e.g., in human resources,
journalism, and education [44]. Furthermore, integrating geolocation smoothly with LLMs for remote
sensing and mobility applications would require high accuracy and trustworthiness. Uncovering such
biases and knowledge gaps is a critical first step towards improving the explainability of these models
and for the development of future solutions.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT in order to: Improve writing style of
some of the sentences. Further, the author(s) used ChatGPT to generate code for data analysis functions,
which were used as a template and modified to fit the relevant data and tasks. After using these
tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility
for the publication’s content.
[9] W. Gurnee, M. Tegmark, Language models represent space and time, arXiv preprint
arXiv:2310.02207 (2023).
[10] R. Manvi, S. Khanna, G. Mai, M. Burke, D. Lobell, S. Ermon, Geollm: Extracting geospatial
knowledge from large language models, arXiv preprint arXiv:2310.06213 (2023).
[11] J. Roberts, T. Lüddecke, S. Das, K. Han, S. Albanie, Gpt4geo: How a language model sees the
world’s geography, arXiv preprint arXiv:2306.00020 (2023).
[12] K. Salmas, D.-A. Pantazi, M. Koubarakis, Extracting Geographic Knowledge from Large Language
Models: An Experiment, in: KBC-LM’23: Knowledge Base Construction from Pre-trained Language
Models workshop at ISWC 2023, 2023.
[13] R. Navigli, S. Conia, B. Ross, Biases in large language models: origins, inventory, and discussion,</p>
      <p>ACM Journal of Data and Information Quality 15 (2023) 1–21.
[14] J. Dunn, B. Adams, H. T. Madabushi, Pre-trained language models represent some geographic
populations better than others, arXiv preprint arXiv:2403.11025 (2024).
[15] A. Kruspe, Towards detecting unanticipated bias in large language models, arXiv preprint
arXiv:2404.02650 (2024).
[16] R. Manvi, S. Khanna, M. Burke, D. Lobell, S. Ermon, Large language models are geographically
biased, arXiv preprint arXiv:2402.02680 (2024).
[17] A. Kruspe, M. Stillman, Saxony-Anhalt is the worst: Bias towards german federal states in
large language models, in: German Conference on Artificial Intelligence (Künstliche Intelligenz),
Springer, 2024, pp. 160–174.
[18] S. Mirza, B. Coelho, Y. Cui, C. Pöpper, D. McCoy, Global-Liar: Factuality of LLMs over time and
geographic regions, arXiv preprint arXiv:2401.17839 (2024).
[19] F. Faisal, A. Anastasopoulos, Geographic and geopolitical biases of language models, arXiv
preprint arXiv:2212.10408 (2022).
[20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[21] E. Cambria, L. Malandri, F. Mercorio, N. Nobani, A. Seveso, Xai meets llms: A survey of the relation
between explainable ai and large language models, arXiv preprint arXiv:2407.15248 (2024).
[22] J. Thebault-Spieker, B. Hecht, L. Terveen, Geographic Biases are ’Born, not Made’ Exploring
Contributors’ Spatiotemporal Behavior in OpenStreetMap, in: Proceedings of the 2018 ACM
International Conference on Supporting Group Work, 2018, pp. 71–82.
[23] C. Hube, Bias in wikipedia, in: Proceedings of the 26th International Conference on World Wide</p>
      <p>Web Companion, 2017, pp. 717–721.
[24] M. Graham, B. Hogan, R. K. Straumann, A. Medhat, Uneven geographies of user-generated
information: Patterns of increasing informational poverty, Annals of the Association of American
Geographers 104 (2014) 746–764.
[25] B. Hecht, M. Stephens, A tale of cities: Urban biases in volunteered geographic information, in:
Proceedings of the international AAAI conference on Web and Social Media, volume 8, 2014, pp.
197–205.
[26] L. Li, M. F. Goodchild, B. Xu, Spatial, temporal, and socioeconomic patterns in the use of Twitter
and Flickr, Cartography and geographic information science 40 (2013) 61–77.
[27] I. Ilyankou, M. Wang, S. Cavazzi, J. Haworth, Quantifying geospatial in the common crawl
corpus, in: Proceedings of the 32nd ACM International Conference on Advances in Geographic
Information Systems, 2024, pp. 585–588.
[28] P. Bhandari, A. Anastasopoulos, D. Pfoser, Are large language models geospatially knowledgeable?,
in: Proceedings of the 31st ACM International Conference on Advances in Geographic Information
Systems, 2023, pp. 1–4.
[29] Z. Li, H. Ning, Autonomous GIS: the next-generation AI-powered GIS, Int. J. Digit. Earth 16 (2023)
4668–4686.
[30] R. Decoupes, R. Interdonato, M. Roche, M. Teisseire, S. Valentin, Evaluation of Geographical
Distortions in Language Models: A Crucial Step Towards Equitable Representations, arXiv preprint
arXiv:2404.17401 (2024).
[31] P. Schwöbel, J. Golebiowski, M. Donini, C. Archambeau, D. Pruthi, Geographical erasure in
language generation, arXiv preprint arXiv:2310.14777 (2023).
[32] S. Das, Evaluating the Capabilities of Large Language Models for Spatial and Situational
Understanding, Ph.D. thesis, Thesis (MA). University of Cambridge, 2023.
[33] P. Mooney, W. Cui, B. Guan, L. Juhász, Towards understanding the geospatial skills of chatgpt:
Taking a geographic information systems (gis) exam, in: Proceedings of the 6th ACM SIGSPATIAL
International Workshop on AI for Geographic Knowledge Discovery, 2023, pp. 85–94.
[34] I. Momennejad, H. Hasanbeig, F. Vieira Frujeri, H. Sharma, N. Jojic, H. Palangi, R. Ness, J. Larson,
Evaluating cognitive maps and planning in large language models with cogeval, in: A. Oh, T.
Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information
Processing Systems, volume 36, Curran Associates, Inc., 2023, pp. 69736–69751. URL: https://proceedings.
neurips.cc/paper_files/paper/2023/file/dc9d5dcf3e86b83e137bad367227c8ca-Paper-Conference.pdf.
[35] M. Moayeri, E. Tabassi, S. Feizi, Worldbench: Quantifying geographic disparities in llm factual
recall, in: The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024, pp.
1211–1228.
[36] S. M. Jain, Hugging face, in: Introduction to transformers for NLP: With the hugging face library
and models to solve problems, Apress Berkeley, CA, 2022, pp. 51–67.
[37] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P.
Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint
arXiv:2307.09288 (2023).
[38] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,</p>
      <p>A. Fan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024).
[39] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree,
A. Bakhtiari, H. Behl, et al., Phi-3 technical report: A highly capable language model locally on
your phone, arXiv preprint arXiv:2404.14219 (2024).
[40] T. Wolf, Transformers: State-of-the-art natural language processing, arXiv preprint
arXiv:1910.03771 (2020).
[41] K. Jordahl, J. V. den Bossche, M. Fleischmann, J. Wasserman, J. McBride, J. Gerard, J. Tratner,
M. Perry, A. G. Badaracco, C. Farmer, G. A. Hjelle, A. D. Snow, M. Cochran, S. Gillies, L. Culbertson,
M. Bartos, N. Eubank, maxalbert, A. Bilogur, S. Rey, C. Ren, D. Arribas-Bel, L. Wasser, L. J. Wolf,
M. Journois, J. Wilson, A. Greenhall, C. Holdgraf, Filipe, F. Leblanc, geopandas/geopandas: v0.8.1,
2020. URL: https://doi.org/10.5281/zenodo.3946761. doi:10.5281/zenodo.3946761.
[42] C. C. Robusto, The cosine-haversine formula, The American Mathematical Monthly 64 (1957)
38–40.
[43] S. Shaham, G. Ghinita, C. Shahabi, Models and mechanisms for spatial data fairness, in: Proceedings
of the VLDB Endowment. International Conference on Very Large Data Bases, volume 16, NIH
Public Access, 2022, p. 167.
[44] C. Filippo, G. Vito, S. Irene, B. Simone, F. Gualtiero, Future applications of generative large
language models: A data-driven case study on chatgpt, Technovation 133 (2024) 103002.</p>
    </sec>
    <sec id="sec-8">
      <title>A. Frequency of use of countries by the LLMs</title>
      <p>B. Round-the-World Trip examples
(c) Llama 3-8B, ”trip around the world”
(e) Llama 3-70B, ”trip around the world”
(g) Phi-3-mini-4k, ”trip around the world”</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lobry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Marcos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Murray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tuia</surname>
          </string-name>
          , Rsvqa:
          <article-title>Visual question answering for remote sensing data</article-title>
          ,
          <source>IEEE Transactions on Geoscience and Remote Sensing</source>
          <volume>58</volume>
          (
          <year>2020</year>
          )
          <fpage>8555</fpage>
          -
          <lpage>8566</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          , T. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>CityGPT: Empowering urban spatial cognition of large language models (</article-title>
          <year>2024</year>
          ). URL: http://arxiv.org/abs/2406.13948. doi:
          <volume>10</volume>
          .48550/arXiv.2406.13948, arXiv:
          <fpage>2406</fpage>
          .13948 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mostafavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gharaibeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Abedin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mandal</surname>
          </string-name>
          , Victimfinder:
          <article-title>Harvesting rescue requests in disaster response from social media with bert</article-title>
          ,
          <source>Computers, Environment and Urban Systems</source>
          <volume>95</volume>
          (
          <year>2022</year>
          )
          <fpage>101824</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X. X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kochupillai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Werner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Häberle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Taubenböck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tuia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Levering</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kruspe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Abdulahhad</surname>
          </string-name>
          ,
          <article-title>Geoinformation harvesting from social media data: A community remote sensing approach</article-title>
          ,
          <source>IEEE Geoscience and Remote Sensing Magazine</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>150</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Janowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>McKenzie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Bhaduri,</surname>
          </string-name>
          <article-title>GeoAI: Spatially explicit artificial intelligence techniques for geographic knowledge discovery and beyond</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cundy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ermon</surname>
          </string-name>
          ,
          <article-title>Towards a foundation model for geospatial artificial intelligence (vision paper)</article-title>
          ,
          <source>in: Proceedings of the 30th International Conference on Advances in Geographic Information Systems</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Handbook of geospatial artificial intelligence</article-title>
          , CRC Press, Boca Raton,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>M. M. Louwerse</surname>
            ,
            <given-names>R. A.</given-names>
          </string-name>
          <string-name>
            <surname>Zwaan</surname>
          </string-name>
          ,
          <article-title>Language encodes geographical information</article-title>
          ,
          <source>Cognitive Science 33</source>
          (
          <year>2009</year>
          )
          <fpage>51</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>(a) Llama 2-7B, ”trip around the world” (b) Llama 2-7B, ”trip circumnavigating the Earth” (d) Llama 3-8B, ”trip circumnavigating the Earth” (f) Llama 3-70B, ”trip circumnavigating the Earth” (h) Phi-3-mini-4k, ”trip circumnavigating the Earth”</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>