Enhancing Toponym Resolution with Fine-Tuned LLMs (Llama2) Xuke Hu1,∗ , Jens Kersten1 1 Institute of Data Science, German Aerospace Center, Jena, 07745, Germany Abstract In this study, we investigate the use of mid-sized and open-source large language models to enhance the extraction of geographic information from texts, focusing on toponym resolution. Our approach involves fine-tuning Llama2 (7B) to accurately derive the unambiguous references of toponyms within textual contexts and subsequently assign geo-coordinates using geocoders. The method is evaluated on two challenging datasets featuring 28,342 global toponyms. The findings demonstrate notable performance improvements compared to existing state-of-the-art methods while maintaining computational efficiency. Keywords geoparsing, toponym resolution, large language model, Llama2 1. Introduction Unstructured texts such as news articles, historical documents, and social media posts are rich sources of geographic information. The extraction of this information, known as geoparsing, is essential in areas like spatial humanities [1], geographic search [2], and disaster manage- ment [3]. Geoparsing involves two key steps: toponym recognition (identifying toponyms in texts) and toponym resolution (inferring the geo-coordinates of these toponyms). While toponym recognition has advanced notably [4][5][6], toponym resolution still faces challenges in disambiguation accuracy [7]. In the rapidly evolving field of natural language processing, large language models (LLMs) such as GPT4 have brought significant changes, also impacting research in geoparsing [8][9]. Yet, existing studies using LLMs for geoparsing focus primarily on toponym recognition. Our research, in contrast, targets the more complex sub-task of geoparsing: toponym resolution. Specifically, we fine-tuned Llama2 (7B) [10], an open-source and powerful model in language comprehension and inference, to estimate toponyms’ unambiguous references, followed by their conversion to geographical coordinates using free geocoders. Our approach demonstrates greater efficacy than seveal leading methods across two challenging datasets. Besides, the approach is computationally efficient, requiring about 14 GB of memory for operation on a standard GPU. GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24, 2024, Glasgow, Scotland ∗ Corresponding author. Envelope-Open xuke.hu@dlr.de (X. Hu); Jens.Kersten@dlr.de (J. Kersten) Orcid 0000-0002-5649-0243 (X. Hu); 0000-0002-4735-7360 (J. Kersten) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Proposed approach Our approach, depicted in Figure 1, involves two phases: training (fine-tuning) and geocoding. Initially, we fine-tune Llama2 using Low-Rank Adaptation (LoRA) [11], a technique that op- timizes GPU resource usage, to predict the unambiguous references (e.g., city, state, county) of toponyms based on their context. Our training dataset is the LGL1 (Local-Global Lexicon) corpus, developed by Lieberman et al. [12], comprising 588 human-annotated news articles with 5088 toponyms from 78 local newspapers. In the geocoding phase, the fine-tuned model first deduces the unambiguous reference of toponyms from their contextual cues. It is then fed into a sequence of free geocoders—primarily GeoNames2 , followed by Nominatim3 and ArcGIS4 . This sequential querying strategy is designed to consult the next geocoder if one fails, enhancing the reliability and precision of the geocoding process. Training dataset Fine-tuning LGL Training Toponym & context Estimate unambiguous reference Coordinates Query unambiguous I am in Glasgow, Montana, US geocoders 48.19696, -106.63671 Glasgow , a city reference in Mont. Geocoding Figure 1: Workflow of the proposed approach. 3. Experiments and evaluation 3.1. Experimental setting For LoRA, the attention dimension, the scaling parameter (𝑎𝑙𝑝ℎ𝑎), and the dropout rate are set to 8, 16, and 0.1, respectively. We employed the AdamW optimizer for fine-tuning with a learning rate of 0.003, over 300 epochs, and a batch size of 128. This fine-tuning process was executed on an NVIDIA Tesla V100 GPU, utilizing about 14 GB of GPU memory. For testing, we used two public datasets, detailed in Table 1. The geographical distribution of the toponyms in the test dataset is shown in Figure 2. Our evaluation employed two metrics [13]: Accuracy@161km for geocoding precision within 161 km (100 miles), and Mean Error (ME) for average distance error. 1 https://github.com/milangritta/Pragmatic-Guide-to-Geoparsing-Evaluation/blob/master/data/Corpora/lgl.xml 2 https://www.geonames.org/ 3 https://nominatim.org/ 4 https://developers.arcgis.com/documentation/mapping-apis-and-services/geocoding/ Figure 2: Geographical spread of 28,342 toponyms from the two datasets. We compared our approach with 10 representative methods. These include a Voting system [7], CamCoder [14], CHF [15], Clavin5 , Blink [16], GENRE [17], Bootleg [18], and the three standard geocoders: Nominatim, GeoNames, and ArcGIS. Among these, CamCoder is a deep learning-based geoparser; CHF and Clavin are rule-based; and Blink, GENRE, and Bootleg are deep learning-based entity linkers. The Voting system integrates seven methods, such as GENRE, Blink, and CamCoder. Table 1 Summary of the two test datasets. KB is the abbreviation of Knowledge Base. Name Text/Tweet Count Toponym Count Type KB/Gazetteer GeoCorpora[19] 6,648 3,100 Tweet GeoNames WikToR[20] 5,000 25,242 Wiki article Wikipedia 3.2. Experimental results The outcomes of our evaluation are presented in Table 2. The results show that our approach outperforms others. On average, it exceeds the performance of the previously best method, the voting system, by 7% in Accuracy@161km and 61% in ME. Compared to the top individual method, GENRE, our approach demonstrates more substantial improvements of 13% in Accuracy@161km and 83% in ME. These findings underscore the effectiveness of our proposed approach. 4. Conclusion This research presents an innovative method for toponym resolution utilizing mid-sized, open- source large language models, specifically Llama2 (7B). Its efficiency is validated through testing 5 https://github.com/Novetta/CLAVIN Table 2 Evaluation results on GeoCorpora and WikToR. Bold numbers indicate the best scores and the second best scores are underlined. GeoCorpora WikToR Accuracy@161km ME (km) Accuracy@161km ME (km) CamCoder 0.72 3506 0.67 501 CHF 0.75 2985 0.44 1264 Nominatim 0.74 1731 0.21 3894 GeoNames 0.71 3683 0.22 4179 ArcGIS 0.77 1224 0.24 3884 Clavin 0.77 2777 0.22 4171 Blink 0.75 1577 0.68 1217 GENRE 0.79 684 0.88 1006 Bootleg 0.69 4425 0.7 1483 Voting 0.84 460 0.91 273 Llama2 (7B) 0.9 247 0.98 37 on two public datasets, establishing a new standard in the field. Furthermore, it maintains significant computational efficiency with a reasonable GPU memory requirement of 14 GB. Future research will aim to investigate a broader range of open-source LLMs for this task and conduct extensive comparative analyses with existing methods across a more diverse array of test datasets. Furthermore, efforts will be directed towards augmenting the models’ geographical knowledge during the inference process by incorporating a toponym’ candidates retrieved from gazetteers, aiming to enhance accuracy and performance further. Declaration of generative AI in manuscript preparation The authors employed ChatGPT to polish the language. Following this, the manuscript under- went a thorough review and necessary modifications by the authors, who assume complete responsibility for the final content. References [1] I. Gregory, C. Donaldson, P. Murrieta-Flores, P. Rayson, Geoparsing, gis, and textual analysis: current developments in spatial humanities research, International Journal of Humanities and Arts Computing 9 (2015) 1–14. [2] R. S. Purves, P. Clough, C. B. Jones, M. H. Hall, V. Murdock, Geographic information retrieval: Progress and challenges in spatial search of text, Foundations and Trends in Information Retrieval 12 (2018) 164–318. [3] Y. Zhang, Z. Chen, X. Zheng, N. Chen, Y. Wang, Extracting the location of flooding events in urban systems and analyzing the semantic risk using social sensing data, Journal of Hydrology 603 (2021) 127053. [4] X. Hu, H. Al-Olimat, J. Kersten, M. Wiegmann, F. Klan, Y. Sun, H. Fan, Gazpne: Annotation- free deep learning for place name extraction from microblogs leveraging gazetteer and synthetic data by rules, International Journal of Geographical Information Science (2021) 1–28. doi:10.1080/13658816.2021.1947507 . [5] X. Hu, Z. Zhou, Y. Sun, J. Kersten, F. Klan, H. Fan, M. Wiegmann, Gazpne2: A general place name extractor for microblogs fusing gazetteers and pretrained transformer models, IEEE Internet of Things Journal (2022) 1–1. doi:10.1109/JIOT.2022.3150967 . [6] X. Hu, Z. Zhou, H. Li, Y. Hu, F. Gu, J. Kersten, H. Fan, F. Klan, Location reference recognition from texts: A survey and comparison, ACM Computing Surveys 56 (2023) 1–37. [7] X. Hu, Y. Sun, J. Kersten, Z. Zhou, F. Klan, H. Fan, How can voting mechanisms improve the robustness and generalizability of toponym disambiguation?, International Journal of Applied Earth Observation and Geoinformation 117 (2023) 103191. [8] G. Mai, C. Cundy, K. Choi, Y. Hu, N. Lao, S. Ermon, Towards a foundation model for geospatial artificial intelligence (vision paper), in: Proceedings of the 30th International Conference on Advances in Geographic Information Systems, 2022, pp. 1–4. [9] Y. Hu, G. Mai, C. Cundy, K. Choi, N. Lao, W. Liu, G. Lakhanpal, R. Z. Zhou, K. Joseph, Geo-knowledge-guided gpt models improve the extraction of location descriptions from disaster-related social media messages, International Journal of Geographical Information Science 37 (2023) 2289–2318. [10] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). [11] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora: Low-rank adaptation of large language models, arXiv preprint arXiv:2106.09685 (2021). [12] M. D. Lieberman, H. Samet, J. Sankaranarayanan, Geotagging with local lexicons to build indexes for textually-specified spatial data, in: 2010 IEEE 26th international conference on data engineering (ICDE 2010), IEEE, 2010, pp. 201–212. [13] M. Gritta, M. T. Pilehvar, N. Collier, A pragmatic guide to geoparsing evaluation, Language resources and evaluation 54 (2020) 683–712. [14] M. Gritta, M. Pilehvar, N. Collier, Which melbourne? augmenting geocoding with maps (2018). [15] E. Kamalloo, D. Rafiei, A coherent unsupervised model for toponym resolution, in: Proceedings of the 2018 World Wide Web Conference, 2018, pp. 1287–1296. [16] L. Wu, F. Petroni, M. Josifoski, S. Riedel, L. Zettlemoyer, Zero-shot entity linking with dense entity retrieval, in: EMNLP, 2020. [17] N. De Cao, G. Izacard, S. Riedel, F. Petroni, Autoregressive entity retrieval, in: International Conference on Learning Representations, 2021. URL: https://openreview.net/forum?id= 5k8F6UU39V. [18] L. Orr, M. Leszczynski, S. Arora, S. Wu, N. Guha, X. Ling, C. Re, Bootleg: Chasing the tail with self-supervised named entity disambiguation, arXiv preprint arXiv:2010.10363 (2020). [19] J. O. Wallgrün, M. Karimzadeh, A. M. MacEachren, S. Pezanowski, Geocorpora: building a corpus to test and train microblog geoparsers, International Journal of Geographical Information Science 32 (2018) 1–29. [20] M. Gritta, M. T. Pilehvar, N. Limsopatham, N. Collier, What’s missing in geographical parsing?, Language Resources and Evaluation 52 (2018) 603–623.