1. Introduction

M. Stillman);

1613-0073

Geolocation in LLMs: Experiments on Probing LLMs for Geographic Knowledge and Reasoning

Mila Stillman

mila.stillman@hm.edu 0 1

Anna Kruspe

anna.kruspe@hm.edu 0 1

Workshop

0 1 0 Hochschule München , Lothstraße 64, München 1 Large Language Models , Bias, Geospatial data, Reasoning, Fairness

2041

000 0 0002

Geographic biases in Large Language Models (LLMs) are evident. Research has shown that both the training data and outputs of LLMs are skewed towards western and economically afluent countries, resulting in the underrepresentation of certain regions. In addition, LLMs are prone to hallucinations, which can lead to the generation of incorrect or fabricated information. In this paper, we present an additional analysis of the geographic knowledge and geospatial reasoning of LLMs through two experiments carried out in English on a global scale. Specifically, we evaluate the geospatial capabilities of four open-source LLMs, namely: Llama 2-7B, Llama 3 (8B and 70B) and Phi-3-mini-4k, and demonstrate that geographic knowledge within LLMs is unevenly distributed across diferent regions of the world. This imbalance could lead to unfair treatment of certain areas and impact various applications that use geographic knowledge, including mobility and remote sensing applications, that aim to use LLMs for data analysis and decision-making.

Reasoning

1. Introduction

Italy

CEUR

ceur-ws.org geography and advanced planning abilities using a fixed number of stops in diferent countries. Here, we test the ability of LLMs to combine those reasoning skills and analyze the efect of diferent starting points on performance.

2. Related Work

Current LLMs use Transformers [20] as their backbone architecture. They are black boxes and are not interpretable without additional components [21]. These models are trained on large textual datasets, or training corpora. These often contain inherent selection biases, leading to models that misrepresent both the underlying data and the real-world phenomena they aim to capture [13]. For instance, crowdsourced geographic data on platforms such as OpenStreetMap [22] and Wikipedia[23, 24], have systemic biases and an uneven distribution of geographic information. In addition, geographic information from geotagged social media posts is mainly skewed towards urban and wealthy regions [ 25, 26, 4 ].

Geographic knowledge in LLMs is an increasingly studied field, especially since spatial and temporal representations have been demonstrated in LLMs [9]. Furthermore, 18.7% of the Common Crawl corpus used to train LLMs have been estimated to contain geospatial information such as addresses and geocoordinates [27]. Research reveals the potential to use LLMs in applications, such as extracting geocoordinates of cities and locations [28, 11], trip itineraries [11], and the use of GIS agents to automate geospatial tasks [29]. However, investigations into the geographic knowledge of these models show limitations, distance distortions, and inequality between regions [30, 14, 31, 28, 11, 32]. [33] researched the geographic knowledge of ChatGPT by measuring its results while taking a Geographic Information Systems (GIS) exam. The research shows that the GPT-3.5 and GPT-4 models achieve test scores of 66% and 88.3%, respectively. Studies testing GPT-4’s capabilities in route planning and geocoded information retrieval reveal a certain level of success on geospatial tasks. However, there are some limitations in abstract reasoning, suggesting that memorization could play a role in task performance [32, 11]. [34] shows that LLMs demonstrate confidence in simple route planning tasks using cognitive maps; however, the authors suggest that this confidence is likely attributed to memorized routes rather than a genuine understanding of cognitive maps, route planning strategies or inference capabilities. Furthermore, the authors indicate that LLMs tend to fail due to hallucinations, constructing too long routes in some cases, or getting stuck in loops. The authors in [35] tested the factuality of LLMs on a global scale using data according to the World Bank and found significant biases for countries with lower income levels and in certain regions. This paper is a continuation of the work on identifying and quantifying existing geographic biases in LLMs. Future mobility applications, such as route planning and navigation, as well as the customization of LLMs to diferent geographic locations, require precise geographic knowledge. Moreover, any potential bias or discriminatory treatment of certain locations by LLMs could pose harm to individuals and therefore requires a careful examination.

3. Methodology

Our experimental setup includes probing and comparing the output of four LLMs for extraction of georeferenced information. In the first experiment, instead of asking LLMs to indicate the geocoordinates of known locations, e.g., cities or points of interest, as has been done in previous research, we conduct a reverse geocoding experiment by selecting geocoordinates in a randomized manner and ask the LLMs to ’guess’ their location. This task is not common in texts; however, intelligent models that have an embedded geographic component should be able to make these educated guesses, similarly to human geographers. The models are downloaded from Huggingface [36], namely Llama 2 [37] with 7B parameters, Llama 3 [38] with 8B and 70B parameters, and Phi 3 mini[39] with 4k parameters. The use of the four models is subject to their respective license agreements. The probing experiments are conducted using the transformers library from Hugging Face [40]. We choose open-source models due to the ability to run these locally without additional API costs. All experiments are run using two Nvidia RTX A6000 GPUs. The exact models used are as follows: Llama-2-7b-chat-hf, Meta-Llama-3.18B-Instruct, Meta-Llama-3.3-70B-Instruct, and Phi-3-mini-4k-instruct. For Llama 2, the chat model was selected due to its additional human-feedback training, which improved its performance on benchmark datasets, compared to the original model. The instruction tuned models are otherwise used due to their improved performance on common industry benchmarks compared to base or chat models. The prompts are adjusted to fit the geographic context via a system prompt. Programming is excluded in this task, since models tend to use code to download and process geospatial data. Instead, to extract potential inherent biases, we probe the implicit geographic knowledge that LLMs already possess. The answer ’not available’ is acceptable when the LLMs are unable to provide an answer. The prompt is adjusted to fit the template provided by each model, while the prompt text remains the same. The number of tokens is limited to 1000. The temperature values are the default values: for Llama 2 and Llama 3 models the temperature value is 0.6. For Phi-3 the temperature was set to 0.0 as suggested for inference in the model’s card on Hugging Face. The Llama 3-70B model was quantized to 8-bit for memory optimization using bitsandbytes from transformers [40]. The text of the system prompt is as follows: Your r o l e i s an e x p e r t g e o g r a p h e r who i s f a m i l i a r w i t h t h e g e o c o o r d i n a t e s y s t e m and w o r l d g e o g r a p h y . P r o v i d e s h o r t and c o n c i s e a n s w e r s t o t h e q u e s t i o n you a r e a s k e d . Programming i s n o t a l l o w e d . I f you do n o t know t h e answer , a n s w e r w i t h ’ n o t a v a i l a b l e ’ .

In both experiments, we used the 177 countries from the naturalearth_lowres dataset using the Geopandas [41] library in Python. In the first experiment, for each country, we generate random uniformly distributed geocoordinates within its bounding box. We use polygon data from the 'naturalearth_lowres' dataset to keep only points that fall on land within the countries’ polygons, removing any in the oceans, until reaching exactly 20 points per country. For each pair of generated geocoordinates, we ask the LLMs to indicate the country to which they belong. The text of the user prompt is provided below: You w i l l be g i v e n a s e t o f g e o c o o r d i n a t e s i n t h e form o f ( l a t , l o n g ) .

P r o v i d e t h e c o u n t r y t o which t h e s e g e o c o o r d i n a t e s b e l o n g . F i r s t , i d e n t i f y t h e c o u n t r y name , t h e n p r o v i d e o n l y t h e c o u n t r y c o d e i n ISO 3 1 6 6 f o r m a t .

Here a r e t h e g e o c o o r d i n a t e s :

In both experiments, the LLMs are asked to identify the country name and provide the country code in ISO-3166 format. This approach to prompting allowed us to improve robustness and avoid diferences in country names from abbreviations and other toponyms.

In the second experiment, we examine the ability of LLMs to plan routes on a global scale. We prompt the LLMs to plan a trip itinerary around the world. In other words, define a route that will circumnavigate the Earth and return to the starting point, by using any means of travel. Curating such a trip requires reasoning capabilities, understanding geography on a global scale, applying a circular shape to the route, while comprehending that the Earth is spherical. To assess potential biases in this task, we ran this experiment using a diferent starting point each time, that is, starting from each of the 177 countries in the dataset. The number of stops was limited to 12 countries. The LLMs are also asked to provide the geocoordinates at each stop. To prevent ambiguities associated with the phrase ”around the world,” which could imply visiting various locations globally without necessarily completing a full circle around the Earth, the experiment is repeated with revised wording, changing the phrase ”a trip around the world” to ”a trip circumnavigating the Earth”. An example of the desired geocoordinate format was provided to encourage its adoption, due to the models’ tendency to provide various formats of geocoordinates. Given the example, this tendency was reduced, but not fully eliminated. In this experiment, the user prompt is as follows:

In both experiments, the prompt is designed to make geospatial predictions in a zero-shot manner. Multiple prompt variations were explored to achieve concise results that convey the necessary information. Further improvements in the form of prompt engineering are left for future work. The code is available at github.com/Milast/geo_biases_llms.

During post-processing, the pycountry library is used to identify country codes in ISO-3166 format within the free text given by the LLMs. The rate of missing, or incorrect, values in the output of the countries by the LLMs in the first experiment is less than 2% for all models. We consider this error rate acceptable when working with generated text. This is partially caused by the option given to the models to answer with ’not available’, as well as due to the LLMs providing false or no country codes in rare cases. The country Kosovo was manually added since it does not have an ISO-3166 code. The geocoordinates generated in the second experiment in some cases did not match the coordinates of the provided country or were beyond the limits of longitude and latitude values. Further inspection and analysis of false geocoding is left for future work.

4. Results 4.1. First experiment

First, we analyze whether the LLMs are able to correctly identify the countries. When asking LLMs for both country name and country code, all models had a high preference towards providing an answer rather than answering with the option ’not available’. The percentage of missing answers due to the optional ’not available’ answer or incorrect country codes is provided in Table 1. The Llama 2 model exhibited the most biased behavior, as well as proved to be the least knowledgeable. The continents with the best performance are Europe and North America, followed by Oceania and South America. Asia had a relatively low score, and all models performed the worst in recognizing countries in Africa. This goes in line with previous research that studied distorted distances and geographic biases in this region. Both Antarctica and Seven Seas (open ocean) continents in the dataset have a single country each and demonstrated low accuracies. This is expected since those are remote areas that most likely do not appear often in the training data of LLMs. Surprisingly, the larger model of Llama 3 with 70B parameters performed worse in this task than the Llama 3 with 8B parameters and the much smaller Phi-3-mini-4k model. As a post-processing step, we analyze whether the country with which the LLMs respond belongs to the same continent as the correct country. Here, we notice a better performance for all continents. The Phi-3-mini-4k and Llama 3-8B models demonstrate high accuracies in Africa compared to the other two models. This suggests that LLMs do possess a certain understanding of geolocation on a continent level. The results for the three models are presented in Figure 1, Figure 2, and Table 1.

The Llama 2 model has the lowest diversity of countries given as output and the lowest number of countries which were correctly identified at least once. Interestingly, the Llama 3 model with 70B parameters has a lower diversity of countries than the both the Llama3-8B and the Phi-3 model. For Llama 2, countries such as Indonesia, South Africa, Turkey and Germany are frequently used. In Llama 3-8B, the most frequently used countries were South Africa and Mozambique, and the Llama 3 70B model frequently used China and Italy. Surprisingly, the Phi-3-mini-4k model most frequently used South Africa, Turkey, Ghana and Kenya. The top 20 countries and their frequency of occurrence for k 4 i n i 3 i h P m B 0 7 m a l L B 8 1 m a l L B 7 2 a m a l L

0 h 2 c

o ea ,t n ) i e h d t i a

h w s s t e s t e a t n h

g i d li r o e o h c t o ( e 0 g

n om ee d w an te r b f e o g t n se ra a s

e i r t n i

n u i

4 o c d ife i t m 3 i h n P e d d i

n y a l t c B e 0 r 7 r o 3 c a f o m

a r l e L b , m B u -8 N 3 : 1 e r u l L , B 7 2 a m a l L : s l e d o e g e B d 0 i 7 ly 3 t c a e m r a r l o L c , f B o 8 r e 3 b a m m u la N L : 2 e r u g

Llama 3-70B each model are presented in Table 3 in Appendix A. We also analyze the relationship between the frequency of countries provided by the LLMs and their Gross Domestic Product (GDP) and population estimate. The Llama 3-8B and the Phi 3 models show a weak correlation with population estimate, while both Llama 3-70B and Llama 2-7B show a moderate correlation. The correlation with GDP is most prominent in the Llama 3-70B model. We calculate the Pearson correlation coeficient and find significant p-values for both GDP and population estimate in all models except Phi-3, with p-values of 0.047 and 0.075 for GDP and population estimate, respectively. The Pearson correlation coeficients calculated and the corresponding p-values are summarized in Table 2. Further analysis of correlation with travel data or other training data was left for future work.

Figure 3 indicates the GDP and population estimate as a function of how often they were mentioned by the LLMs.

4.2. Second experiment

In the second experiment, the LLMs generate trip itineraries for a trip around the world. We calculate and analyze the average total distance traveled using the Haversine Distance formula [42] and find that it is much longer in the Llama 3-70B and Phi-3 models, with an average distance traveled resembling the Earth’s circumference of around 40,000 km or higher. In comparison, the Llama 2 and Llama 3-8B models’ average distances were close to 30,000 km or shorter. Moreover, the choice of wording, or lexical choice, has an efect on the average travel distance. For example, in the Phi 3 model the distance is shorter with the word ’circumnavigate’, while for Llama 2 and Llama 3-8B it is longer. For Llama 2 and Phi 3, the average distance traveled from African countries is the shortest, followed by Asia and Europe. The distance traveled by Llama 3-70B is robust to changes in language and to starting countries on diferent continents.

Furthermore, the percentage of countries chosen for the trip around the world that belong to the same continent as that of the starting country is found to be significantly higher in Africa for Phi-3 and Llama 2, irrespective of the syntactic variation and without repeating countries. In terms of the shape of the trips, the qualitative results suggest that Llama 3-70B can best capture a rounded trip. Examples of trip shapes can be found in the Appendix 6. The average distance traveled and the percentage of stops within the same continent as the starting point are presented in Figure 4 and Figure 5, respectively.

5. Discussion

Some of the inaccuracies generated by LLMs are expected, due to the training data of LLMs, which includes geographic data (e.g., from Wikipedia, social media, etc.) that are potentially skewed towards western and more afluent parts of society. Although LLMs might not possess, or be able to interpret, polygon information of countries to perform reverse geocoding accurately, the models did attempt to guess the country in most cases, even when given an option to answer with ’not available’. This efect could partially be due to the choice of models, prompts, and model parameters, which could be further optimized in the future. Surprisingly, the Llama 2 model has a small number of selected countries, which could be the result of the diference in the size of the training data compared to the other models. The Phi-3-mini-4k, a much smaller model, has competitive performance in terms of accuracy and bias. Surprisingly, the least biased model was Llama 3-8B, and performed even better than the Llama 3-70B model in the first experiment. This indicates that perhaps a larger model is not necessarily superior to smaller models in the geographic information extraction task. However, in the second experiment the larger model proved to be more robust to lexical choice and provided more circular routes than the other models, suggesting improved geospatial reasoning capabilities. The Phi-3 model generated longer trips too; however, it was less robust to formulation and formed trip shapes that are more chaotic. Some potential explanations for less accurate routes could include travel restrictions to certain countries and not enough training data from people traveling around the world from these regions. Another interesting result is that while toponym resolution remains a challenge in GeoAI, specifically when probing LLMs for geographic knowledge, we found that unifying country names using country codes worked well. Nevertheless, the geocoordinates provided by the LLMs had diferent formats and required manual work and resolution. In some cases, they were beyond the geographic limits, or misrepresented the complementary country. The optimization, verification and standardization of the output of LLMs in the geographic context could be a future research direction.

6. Conclusion

Large Language Models exhibit biases, including geographic biases. In this paper, we demonstrate that the geographic knowledge of LLMs is partially inaccurate and biased against less economically strong regions of the world. We conducted experiments with four LLMs to assess their ability to perform two geospatial tasks. We find that the representation of objective geographic knowledge is unequal between regions and that some countries are overrepresented. Future work includes testing more LLMs, using multilingual prompts and chain-of-thought prompts with provided polygon information, a larger amount of generated geocoordinates, and prompts using diferent geographic granularity. In addition, spatial fairness is an important aspect in avoiding common pitfalls [43]. Since random geocoordinates are used in the first experiment, it is crucial to conduct a thorough analysis of these locations. For instance, in densely populated areas, random geocoordinates may provide more information compared to those in rural or sparsely populated regions. Furthermore, conducting correlation analyzes with data sources such as worldwide travel data and other sources of LLMs’ training data could be beneficial. Finally, bias in LLMs afects a variety of emerging socially critical applications, e.g., in human resources, journalism, and education [44]. Furthermore, integrating geolocation smoothly with LLMs for remote sensing and mobility applications would require high accuracy and trustworthiness. Uncovering such biases and knowledge gaps is a critical first step towards improving the explainability of these models and for the development of future solutions.

Declaration on Generative AI

During the preparation of this work, the author(s) used ChatGPT in order to: Improve writing style of some of the sentences. Further, the author(s) used ChatGPT to generate code for data analysis functions, which were used as a template and modified to fit the relevant data and tasks. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. [9] W. Gurnee, M. Tegmark, Language models represent space and time, arXiv preprint arXiv:2310.02207 (2023). [10] R. Manvi, S. Khanna, G. Mai, M. Burke, D. Lobell, S. Ermon, Geollm: Extracting geospatial knowledge from large language models, arXiv preprint arXiv:2310.06213 (2023). [11] J. Roberts, T. Lüddecke, S. Das, K. Han, S. Albanie, Gpt4geo: How a language model sees the world’s geography, arXiv preprint arXiv:2306.00020 (2023). [12] K. Salmas, D.-A. Pantazi, M. Koubarakis, Extracting Geographic Knowledge from Large Language Models: An Experiment, in: KBC-LM’23: Knowledge Base Construction from Pre-trained Language Models workshop at ISWC 2023, 2023. [13] R. Navigli, S. Conia, B. Ross, Biases in large language models: origins, inventory, and discussion,

ACM Journal of Data and Information Quality 15 (2023) 1–21. [14] J. Dunn, B. Adams, H. T. Madabushi, Pre-trained language models represent some geographic populations better than others, arXiv preprint arXiv:2403.11025 (2024). [15] A. Kruspe, Towards detecting unanticipated bias in large language models, arXiv preprint arXiv:2404.02650 (2024). [16] R. Manvi, S. Khanna, M. Burke, D. Lobell, S. Ermon, Large language models are geographically biased, arXiv preprint arXiv:2402.02680 (2024). [17] A. Kruspe, M. Stillman, Saxony-Anhalt is the worst: Bias towards german federal states in large language models, in: German Conference on Artificial Intelligence (Künstliche Intelligenz), Springer, 2024, pp. 160–174. [18] S. Mirza, B. Coelho, Y. Cui, C. Pöpper, D. McCoy, Global-Liar: Factuality of LLMs over time and geographic regions, arXiv preprint arXiv:2401.17839 (2024). [19] F. Faisal, A. Anastasopoulos, Geographic and geopolitical biases of language models, arXiv preprint arXiv:2212.10408 (2022). [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,

Attention is all you need, Advances in neural information processing systems 30 (2017). [21] E. Cambria, L. Malandri, F. Mercorio, N. Nobani, A. Seveso, Xai meets llms: A survey of the relation between explainable ai and large language models, arXiv preprint arXiv:2407.15248 (2024). [22] J. Thebault-Spieker, B. Hecht, L. Terveen, Geographic Biases are ’Born, not Made’ Exploring Contributors’ Spatiotemporal Behavior in OpenStreetMap, in: Proceedings of the 2018 ACM International Conference on Supporting Group Work, 2018, pp. 71–82. [23] C. Hube, Bias in wikipedia, in: Proceedings of the 26th International Conference on World Wide

Web Companion, 2017, pp. 717–721. [24] M. Graham, B. Hogan, R. K. Straumann, A. Medhat, Uneven geographies of user-generated information: Patterns of increasing informational poverty, Annals of the Association of American Geographers 104 (2014) 746–764. [25] B. Hecht, M. Stephens, A tale of cities: Urban biases in volunteered geographic information, in: Proceedings of the international AAAI conference on Web and Social Media, volume 8, 2014, pp. 197–205. [26] L. Li, M. F. Goodchild, B. Xu, Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr, Cartography and geographic information science 40 (2013) 61–77. [27] I. Ilyankou, M. Wang, S. Cavazzi, J. Haworth, Quantifying geospatial in the common crawl corpus, in: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems, 2024, pp. 585–588. [28] P. Bhandari, A. Anastasopoulos, D. Pfoser, Are large language models geospatially knowledgeable?, in: Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems, 2023, pp. 1–4. [29] Z. Li, H. Ning, Autonomous GIS: the next-generation AI-powered GIS, Int. J. Digit. Earth 16 (2023) 4668–4686. [30] R. Decoupes, R. Interdonato, M. Roche, M. Teisseire, S. Valentin, Evaluation of Geographical Distortions in Language Models: A Crucial Step Towards Equitable Representations, arXiv preprint arXiv:2404.17401 (2024). [31] P. Schwöbel, J. Golebiowski, M. Donini, C. Archambeau, D. Pruthi, Geographical erasure in language generation, arXiv preprint arXiv:2310.14777 (2023). [32] S. Das, Evaluating the Capabilities of Large Language Models for Spatial and Situational Understanding, Ph.D. thesis, Thesis (MA). University of Cambridge, 2023. [33] P. Mooney, W. Cui, B. Guan, L. Juhász, Towards understanding the geospatial skills of chatgpt: Taking a geographic information systems (gis) exam, in: Proceedings of the 6th ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery, 2023, pp. 85–94. [34] I. Momennejad, H. Hasanbeig, F. Vieira Frujeri, H. Sharma, N. Jojic, H. Palangi, R. Ness, J. Larson, Evaluating cognitive maps and planning in large language models with cogeval, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Processing Systems, volume 36, Curran Associates, Inc., 2023, pp. 69736–69751. URL: https://proceedings. neurips.cc/paper_files/paper/2023/file/dc9d5dcf3e86b83e137bad367227c8ca-Paper-Conference.pdf. [35] M. Moayeri, E. Tabassi, S. Feizi, Worldbench: Quantifying geographic disparities in llm factual recall, in: The 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024, pp. 1211–1228. [36] S. M. Jain, Hugging face, in: Introduction to transformers for NLP: With the hugging face library and models to solve problems, Apress Berkeley, CA, 2022, pp. 51–67. [37] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). [38] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,

A. Fan, et al., The llama 3 herd of models, arXiv preprint arXiv:2407.21783 (2024). [39] M. Abdin, S. A. Jacobs, A. A. Awan, J. Aneja, A. Awadallah, H. Awadalla, N. Bach, A. Bahree, A. Bakhtiari, H. Behl, et al., Phi-3 technical report: A highly capable language model locally on your phone, arXiv preprint arXiv:2404.14219 (2024). [40] T. Wolf, Transformers: State-of-the-art natural language processing, arXiv preprint arXiv:1910.03771 (2020). [41] K. Jordahl, J. V. den Bossche, M. Fleischmann, J. Wasserman, J. McBride, J. Gerard, J. Tratner, M. Perry, A. G. Badaracco, C. Farmer, G. A. Hjelle, A. D. Snow, M. Cochran, S. Gillies, L. Culbertson, M. Bartos, N. Eubank, maxalbert, A. Bilogur, S. Rey, C. Ren, D. Arribas-Bel, L. Wasser, L. J. Wolf, M. Journois, J. Wilson, A. Greenhall, C. Holdgraf, Filipe, F. Leblanc, geopandas/geopandas: v0.8.1, 2020. URL: https://doi.org/10.5281/zenodo.3946761. doi:10.5281/zenodo.3946761. [42] C. C. Robusto, The cosine-haversine formula, The American Mathematical Monthly 64 (1957) 38–40. [43] S. Shaham, G. Ghinita, C. Shahabi, Models and mechanisms for spatial data fairness, in: Proceedings of the VLDB Endowment. International Conference on Very Large Data Bases, volume 16, NIH Public Access, 2022, p. 167. [44] C. Filippo, G. Vito, S. Irene, B. Simone, F. Gualtiero, Future applications of generative large language models: A data-driven case study on chatgpt, Technovation 133 (2024) 103002.

A. Frequency of use of countries by the LLMs

B. Round-the-World Trip examples (c) Llama 3-8B, ”trip around the world” (e) Llama 3-70B, ”trip around the world” (g) Phi-3-mini-4k, ”trip around the world”

[1]

Lobry ,

Marcos ,

Murray ,

Tuia , Rsvqa: Visual question answering for remote sensing data , IEEE Transactions on Geoscience and Remote Sensing 58 ( 2020 ) 8555 - 8566 .

[2]

Feng ,

Du , T. Liu,

Guo ,

Lin ,

Li , CityGPT: Empowering urban spatial cognition of large language models ( 2024 ). URL: http://arxiv.org/abs/2406.13948. doi: 10 .48550/arXiv.2406.13948, arXiv: 2406 .13948 [cs].

[3]

Zhou ,

Zou ,

Mostafavi ,

Lin ,

Yang ,

Gharaibeh ,

Cai ,

Abedin ,

Mandal , Victimfinder: Harvesting rescue requests in disaster response from social media with bert , Computers, Environment and Urban Systems 95 ( 2022 ) 101824 .

[4]

X. X.

Zhu ,

Wang ,

Kochupillai ,

Werner ,

Häberle ,

E. J.

Hofmann ,

Taubenböck ,

Tuia ,

Levering ,

Jacobs ,

Kruspe ,

Abdulahhad , Geoinformation harvesting from social media data: A community remote sensing approach , IEEE Geoscience and Remote Sensing Magazine 10 ( 2022 ) 150 - 180 .

[5]

Janowicz ,

Gao ,

McKenzie ,

Hu , B. Bhaduri, GeoAI: Spatially explicit artificial intelligence techniques for geographic knowledge discovery and beyond , 2020 .

[6]

Mai ,

Cundy ,

Choi ,

Hu ,

Lao ,

Ermon , Towards a foundation model for geospatial artificial intelligence (vision paper) , in: Proceedings of the 30th International Conference on Advances in Geographic Information Systems , 2022 , pp. 1 - 4 .

[7]

Gao ,

Hu ,

Li , Handbook of geospatial artificial intelligence , CRC Press, Boca Raton, 2023 .

[8] M. M. Louwerse , R. A. Zwaan , Language encodes geographical information , Cognitive Science 33 ( 2009 ) 51 - 73 .

(a) Llama 2-7B, ”trip around the world” (b) Llama 2-7B, ”trip circumnavigating the Earth” (d) Llama 3-8B, ”trip circumnavigating the Earth” (f) Llama 3-70B, ”trip circumnavigating the Earth” (h) Phi-3-mini-4k, ”trip circumnavigating the Earth”