Fine-Tuning Transformers for Toponym Resolution: A Contextual Embedding Approach to Candidate Ranking Diego Gomes1 , Ross S. Purves1 and Michele Volpi2 1 Department of Geography, University of Zurich, Winterthurerstrasse 190, 8057 Zürich, Switzerland 2 Swiss Data Science Center, ETH Zurich and EPFL, Andreasstrasse 5, 8050 Zurich, Switzerland Abstract We introduce a new approach to toponym resolution, leveraging transformer-based Siamese networks to disambiguate geographical references in unstructured text. Our methodology consists of two steps: the generation of location candidates using the GeoNames gazetteer, and the ranking of these candidates based on their semantic similarity to the toponym in its document context. The core of the proposed method lies in the adaption of SentenceTransformer models, originally designed for sentence similarity tasks, to toponym resolution by fine-tuning them on geographically annotated English news article datasets (Local Global Lexicon, GeoWebNews, and TR-News). The models are used to generate contextual embeddings of both toponyms and textual representations of location candidates, which are then used to rank candidates using cosine similarity. The results suggest that the fine-tuned models outperform existing solutions in several key metrics. Keywords Geographic Information Retrieval, toponym resolution, transformer, gazetteer 1. Introduction Toponym resolution, the task of assigning unique identifiers to geographical locations referred to by place names in texts, is an essential yet challenging aspect of geographic information retrieval [1]. The emergence of transformer-based models in natural language processing [2] has opened new avenues to address these challenges, providing sophisticated means to capture the nuanced relationships between textual context and geographical references. In this paper, we present a new approach that leverages the capabilities of transformer models, specifically using the SentenceTransformers framework [3], originally designed for sentence similarity tasks. Our methodology reimagines toponym resolution as a variant of sentence similarity, comparing document-based embeddings to those generated from gazetteers to disentangle the complexities of geographical references within unstructured text. Our approach consist of two main steps: the generation of location candidates and the ranking of these candidates based on contextual embeddings generated by fine-tuned transformer- GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24, 2024, Glasgow, Scotland Envelope-Open diego.gomes@uzh.ch (D. Gomes); ross.purves@geo.uzh.ch (R. S. Purves); michele.volpi@sdsc.ethz.ch (M. Volpi) GLOBE www.geo.uzh.ch/~rsp (R. S. Purves); www.datascience.ch/people/michele-volpi (M. Volpi) Orcid 0009-0003-8449-2603 (D. Gomes); 0000-0002-9878-9243 (R. S. Purves); 0000-0003-2771-0750 (M. Volpi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings based Siamese networks. By adapting the SentenceTransformers framework for toponym resolution, we capitalise on the powerful ability of these pre-trained models to compare texts in a semantically meaningful way. The results of our research demonstrate that this approach outperforms recent work in three of four metrics for toponym resolution, offering a scalable, efficient, and accurate solution for the field. 2. Related Work Recently, transformer-based methods have increasingly influenced toponym resolution method- ologies. We define existing approaches as either localisation-based or ranking-based. Localisation-based approaches focus on the direct prediction of geographic coordinates or areas from textual input. Radford’s [4] method employs DistilRoBERTa for end-to-end probabilistic geocoding, while Cardoso et al. [5] use LSTMs with BERT embeddings to predict probability distributions over spatial regions. Similarly, Solaz and Shalumov [6] use the T5 transformer model in a sequence-to-sequence framework to translate text into hierarchical encodings of geographic cells. Ranking-based approaches focus on the effective ranking of location candidates. Halterman’s [7] Mordecai 3 system integrates spaCy transformer embeddings into a neural model that ranks candidates based on similarity measures. Li et al.’s [8] GeoLM aligns linguistic context with geospatial information through contrastive learning, enhancing language models’ understand- ing of geographic entities. In Zhang and Bethard’s [9] GeoNorm framework, a BERT-based transformer model is employed to rerank location candidates, using contextual embeddings to prioritise candidates that best match a toponym’s context. A critical limitation in many transformer-based toponym resolution methods is the lack of task-specific model fine-tuning. Transformer models such as BERT [10], which are pre-trained on tasks such as masked language modelling and next sentence prediction, are trained to produce embeddings optimised for those tasks. Consequently, using these embeddings directly for toponym resolution may limit their effectiveness. Studies such as those by Cardoso et al. [5] and Halterman [7] use embeddings produced by off-the-shelf transformer models without task-specific fine-tuning, potentially limiting the efficacy of these embeddings for toponym resolution. This consideration is consistent with the findings of Reimers and Gurevych [3], who found that while base models such as BERT perform poorly on sentence similarity tasks, their performance improves significantly after task-specific fine-tuning. Thus, fine-tuning these models for toponym resolution could be a crucial aspect in their ability to generate contextually relevant embeddings. Another issue with machine learning-based toponym resolution methods in general is geo- graphic bias, stemming from the geographic imbalance of training datasets. Trained models tend to favour locations that are overrepresented in the training corpora, as highlighted by Liu et al. [11]. The limited availability and domain diversity of geotagged datasets further exacerbates this bias [12]. Geoparsing methods should aim to generalise the toponym resolution capability acquired from training on geographically biased data to a global context, thus ensuring broad applicability and reliability of the models across different geographical regions. 3. Proposed Method This study introduces a new method for toponym resolution, centred around the use of transformer-based Siamese networks fine-tuned to discern geographically relevant contex- tual cues within texts. Our approach unfolds in two key phases: the generation of location candidates and the ranking of these candidates. The first phase involves compiling potential geographical matches for identified toponyms using a gazetteer, while the second phase focuses on ranking these candidates to select the most contextually appropriate location. Candidate generation involves querying toponyms in a toponym index created from the GeoNames database. This index contains standard and alternate location names, supplemented with externally sourced demonyms, and serves as the primary resource for retrieving location candidates for toponyms. In instances where the index fails to return results using exact string matching, a fallback mechanism initiates API calls to GeoNames using both regular and fuzzy search parameters, ensuring that a list of location candidates is generated for every toponym. This fallback procedure, while simple, effectively broadens the scope of potential matches, albeit with a possible trade-off in precision. The candidate ranking process involves generating contextual embeddings for toponyms and each of their location candidates using a transformer-based model. Toponym embeddings are created using the toponym’s source document as input to the model. To create candidate embeddings, first unique textual representations are created, incorporating textual descriptors of geographical identifiers like country, administrative divisions, and feature types. These representations are then fed into the same model to obtain semantically enriched candidate embeddings. Since both toponym and candidate embeddings reside in the same vector space, cosine similarity scores can be used to rank lists of candidates and ultimately make a prediction about the most likely referent of the toponym. Our approach aims to reduce geographic bias by emphasising contextual understanding over direct geographic associations. We hypothesise that by training models to detect and interpret geographic cues within texts, rather than learning geographic correlations, they are less likely to inherit biases from geographically skewed training datasets. By focusing on the extraction and comparison of geographically relevant contextual cues, we posit that the models develop a more generalised ability to resolve toponyms in English, less tethered to the geographic distributions present in the training data. We note, however, that this hypothesised reduction in geographic bias is an initial assumption based on the architectural design of the system. Empirical validation across diverse and globally representative datasets will be crucial to substantiate this claim and fully assess the effectiveness of our approach in mitigating geographic bias in toponym resolution. Furthermore, we do not make claims about the direct transferability of our approach to languages other than English. At the heart of our methodology lies the adaptation of the SentenceTransformers framework [3], originally designed for sentence similarity tasks. We reimagine toponym resolution as a variant of sentence similarity, where the contextual relationship between a toponym and its geographical referent is analogous to the semantic relationship between two sentences. The SentenceTransformer models, known for their efficacy in generating semantically comparable sentence embeddings, are repurposed to generate embeddings for both toponyms and their location candidates. Using these models to encode geographical references in a comparable Table 1 Dataset properties LGL GWN TRN Number of articles 587 199 118 Number of toponyms 4439 2401 1271 Number of unique GeoName IDs 1076 579 349 way seeks to harness their inherent strengths in understanding textual context and nuance. To use SentenceTransformer models for toponym resolution, we fine-tune them using geo- graphically annotated texts. This entails adapting the pre-trained models using training data that juxtaposes toponyms with both their correct and incorrect geographical matches. During this process, the models are trained to produce pairs of embeddings that are closely aligned for correct toponym-location pairs and distinctly separate for incorrect ones. This ability feeds di- rectly into the ranking of location candidates based on their semantic similarity to the toponym as it appears in the text. By learning to generate embeddings that accurately reflect the relevant geographic information embedded in the context of the toponym, the models gain the ability to effectively discern and prioritise the most likely geographic location. A key aspect of using the SentenceTransformers framework is the computational efficiency it brings to the methodology. Thanks to the Siamese network architecture, these models act as encoders that can process individual units of text independently. This architectural feature allows for the pre-computation of embeddings for all locations in a gazetteer. This means that during the toponym resolution process, the system only needs to generate embeddings for toponyms, significantly reducing the computational load. 4. Experiments 4.1. Datasets In this pilot study, we used three existing annotated datasets of English news articles (Table 1). These are the Local Global Lexicon (LGL) [13], the GeoWebNews (GWN) [14], and the TR- News (TRN) [15]. The LGL dataset is heavily concentrated in the United States, with moderate coverage in Europe and the Middle East, and sparse coverage in other regions (Figure 1). The GWN dataset shows a similar pattern, but with a broader European coverage and notable coverage in Africa, the Middle East, and some Asian regions. The TRN dataset, while also focused heavily on the United States, presents a more balanced distribution across Europe, the Middle East, East Asia, and Australia. The choice of these specific datasets aligns our work with that of Zhang and Bethard [9]. By using the same datasets and mirroring the data splits into training, evaluation, and testing segments (70%, 10%, and 20%, respectively), we aim to provide a direct comparison. For training and interim evaluations, data from the three datasets were pooled, while for final testing, they were kept separate, allowing separate performance assessments on each dataset. Figure 1: Geographical distribution of toponyms in the 3 datasets 4.2. Data Preparation Text documents containing toponyms were truncated to comply with the input sequence limits of the SentenceTransformer models, taking care to preserve the integrity of sentences and to keep the toponyms centred within the truncated text. A small number of toponyms with outdated or invalid GeoName IDs (n = 27) were removed from the datasets. Candidate generation involved compiling a list of location candidates for each toponym. The recall rates for this process were 97.5% for LGL, 90.2% for GWN, and 98.5% for TRN. In some cases, the correct location was not included in the compiled lists, thus setting a performance ceiling for subsequent toponym resolution. Textual representations of location candidates were constructed using attributes retrieved from candidate’s GeoNames entries, incorporating the name, feature type, and relevant ad- ministrative and geographical identifiers in a pseudo-sentence format, formulated as: “[name] ([feature type]) in [admin2], [admin1], [country]”. This approach was designed to provide geographically distinct descriptors of location candidates that can be processed in the same way as text documents. To create training examples, texts containing toponyms were paired with the textual repre- sentations of location candidates. For positive training examples, the correct locations were retrieved from the toponym labels provided in the datasets. For negative examples, the to- ponym’s list of location candidates was used to generate incorrect pairings from all items in the lists, excluding the correct location. This was done without taking into account the proportion of positive to negative examples. 4.3. Fine-Tuning The fine-tuning of the SentenceTransformer models involved training on the prepared examples using a contrastive loss function. This taught the models to generate pairs of embeddings that were similar for correct toponym-location pairs and dissimilar for incorrect pairs. Models were evaluated at regular intervals during the training process using the separate evaluation set. They were assessed every 10% of the training steps, with the most accurate model on the evaluation set being selected for final testing. The entire training phase was completed in a single epoch, using a batch size of 8. No hyperparameter optimisation was performed. 4.4. Evaluation Metrics In line with Zhang and Bethard [9] and guided by Gritta et al. [14], we adopt four metrics to evaluate our methodology. Accuracy (A) measures the exact match rate of predicted and labelled GeoName IDs, providing a binary assessment of correctness. Accuracy@161km (A161) provides a broader perspective, assessing the proportion of toponyms that are correctly resolved within a 161 km (100 mile) radius, thus allowing for minor geographical deviations. Mean error distance (MED) quantifies the average geographical deviation of predictions from true locations. Finally, the area under the curve (AUC) assesses the error distribution, particularly accounting for outliers, by integrating the area under a curve of scaled logarithmic error distances. 5. Results In the evaluation of two SentenceTransformer models for toponym resolution, the base models, originally designed for sentence similarity tasks, showed better than random but generally poor performance, and were outperformed by a simple population-based method (Table 2). However, a substantial increase in performance was observed after the models were fine-tuned with every model outperforming the population baseline across all datasets and metrics. Compared to the recently introduced GeoNorm model by Zhang and Bethard [9], the fine- tuned SentenceTransformer models showed comparable or superior performance across all evaluated corpora and all metrics except for mean error distance. 6. Discussion In this study, we have demonstrated the efficacy of adapted SentenceTransformer models for toponym resolution. Our results underline the viability of this approach, with the models achieving state-of-the-art performance in the context of the datasets used. Nevertheless, it is important to note the constraints and limitations of this study. The training data, exclusively sourced from news articles, was limited in both volume and diversity. This limitation was partly intentional, to align our methodology with that of Zhang and Bethard [9] for direct comparability, however, it also reflects a broader challenge in the field: the scarcity of geographically annotated text corpora spanning diverse domains [12]. Going forward, enriching Table 2 Comparison of model performances on test sets using accuracy (A), accuracy at 161 km (A161), mean error distance (MED), and area under the curve (AUC) Dataset Model A A161 MED AUC Random 0.229 0.278 3579 0.588 Population 0.650 0.732 1149 0.229 GeoNorm 0.799 0.828 52 0.136 LGL all-distilroberta-v1 0.417 0.518 1922 0.398 all-mpnet-base-v2 0.398 0.472 1660 0.411 all-distilroberta-v1 (fine-tuned) 0.843 0.887 280 0.096 all-mpnet-base-v2 (fine-tuned) 0.825 0.880 320 0.107 Random 0.288 0.348 3585 0.551 Population 0.727 0.850 723 0.153 GeoNorm 0.832 0.876 54 0.104 GWN all-distilroberta-v1 0.406 0.481 2782 0.441 all-mpnet-base-v2 0.429 0.496 2335 0.421 all-distilroberta-v1 (fine-tuned) 0.845 0.915 438 0.089 all-mpnet-base-v2 (fine-tuned) 0.862 0.925 325 0.075 Random 0.253 0.308 4209 0.594 Population 0.778 0.859 609 0.126 GeoNorm 0.897 0.911 36 0.073 TRN all-distilroberta-v1 0.414 0.490 3352 0.446 all-mpnet-base-v2 0.480 0.530 2126 0.383 all-distilroberta-v1 (fine-tuned) 0.939 0.975 61 0.021 all-mpnet-base-v2 (fine-tuned) 0.934 0.975 61 0.022 the training datasets with more numerous and varied text sources could potentially improve the models’ robustness and applicability across different contexts. Another limitation of our experimental setup was the relatively simplistic representation of location candidates with attributes sourced from GeoNames. Although the employed strategy was effective, there is substantial room for enrichment. Given the capacity of the models to process text sequences of 256-512 tokens, there is untapped potential for augmenting location descriptions. Incorporating additional information from knowledge bases or integrating spatial data, such as nearby landmarks or geographical features, could improve the models’ ability to more accurately match toponyms to their geographical referents. Such an enhancement could lead to more nuanced associations between contextual cues in texts and specific location attributes, potentially increasing the resolution accuracy. Our exploration was confined to the SentenceTransformers framework, which presented both advantages and limitations. The intuitiveness of the framework and the availability of pre-trained models for sentence similarity tasks provided a solid foundation for our experiments. Nonetheless, this choice came with certain architectural constraints. In particular, the generation of embeddings via mean pooling of whole text sequences raises questions about the optimal representation of toponyms, especially in sentences containing multiple toponyms. Further experiments will be necessary to explore whether single token embeddings might be more effective when applying transformer models to the task of toponym resolution. While our experiments were designed for comparability with the work of Zhang and Bethard [9], it is important to acknowledge the broader landscape of toponym resolution research. Hu et al. [16] provide a comprehensive overview of toponym resolution approaches, including their novel spatial clustering-based voting approach that combines several individual methods. Our method showed superior performance compared to all of these approaches on the tested datasets. However, this comparison may not be entirely fair, given that our models were trained on data sourced from the same domain used for testing. This scenario potentially provided our models with an inherent advantage over others evaluated by Hu et al. In future work we will attempt to replicate these frameworks using an out-of-domain dataset for training. Finally, we are unsure why our models outperformed GeoNorm for all metrics except mean error distance. One possible explanation would relate to the possible candidate matches - by including alternative names and fuzzy matching we may penalise our approach, but this discrep- ancy again points to the difficulties in effectively comparing toponym resolution methodologies [12]. 7. Conclusion This paper has presented a proposed new approach to the application of transformer-based models, specifically the SentenceTransformers framework, to the task of toponym resolution. While the proposed methodology has shown promising results, achieving state-of-the-art performance, we currently view it as a proof of concept. Several elements of the proposed methodology, such as configurations and training paradigms, are preliminary and require further research and more rigorous evaluation. As such, the true extent and applicability of this novel approach remain to be fully realised and validated. Project repository: https://github.com/dguzh/SemTopRes References [1] R. S. Purves, P. Clough, C. B. Jones, M. H. Hall, V. Murdock, Geographic information retrieval: Progress and challenges in spatial search of text, Foundations and Trends® in Information Retrieval 12 (2018) 164–318. doi:10.1561/1500000034 . [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, arXiv (2017). doi:10.48550/arXiv.1706.03762 . [3] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Process- ing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3982–3992. doi:10.18653/v1/D19- 1410 . [4] B. J. Radford, Regressing location on text for probabilistic geocoding, in: Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), Association for Computational Linguistics, Online, 2021, pp. 53–57. doi:10.18653/v1/2021.case- 1.8 . [5] A. B. Cardoso, B. Martins, J. Estima, A novel deep learning approach using contextual embeddings for toponym resolution, ISPRS International Journal of Geo-Information 11 (2022) Article No. 28. doi:10.3390/ijgi11010028 . [6] Y. Solaz, V. Shalumov, Transformer based geocoding, arXiv (2023). doi:10.48550/arXiv. 2301.01170 . [7] A. Halterman, Mordecai 3: A neural geoparser and event geocoder, arXiv (2023). doi:10. 48550/arXiv.2303.13675 . [8] Z. Li, W. Zhou, Y.-Y. Chiang, M. Chen, GeoLM: Empowering language models for geospa- tially grounded language understanding, in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Association for Computational Lin- guistics, Singapore, 2023, pp. 5227–5240. doi:10.18653/v1/2023.emnlp- main.317 . [9] Z. Zhang, S. Bethard, Improving toponym resolution with better candidate generation, transformer-based reranking, and two-stage resolution, in: Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 48–60. doi:10.18653/v1/2023. starsem- 1.6 . [10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, United States, 2019, pp. 4171–4186. doi:10.18653/ v1/N19- 1423 . [11] Z. Liu, K. Janowicz, L. Cai, R. Zhu, G. Mai, M. Shi, Geoparsing: Solved or biased? an evaluation of geographic biases in geoparsing, AGILE: GIScience Series 3 (2022) 1–13. doi:10.5194/agile- giss- 3- 9- 2022 . [12] M. Gritta, M. T. Pilehvar, N. Limsopatham, N. Collier, What’s missing in geograph- ical parsing?, Language Resources and Evaluation 52 (2018) 603–623. doi:10.1007/ s10579- 017- 9385- 8 . [13] M. D. Lieberman, J. Sankaranarayanan, H. Samet, Geotagging with local lexicons to build indexes for textually-specified spatial data, in: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), IEEE Computer Society, Los Alamitos, California, United States, 2010, pp. 201–212. doi:10.1109/ICDE.2010.5447903 . [14] M. Gritta, M. T. Pilehvar, N. Collier, A pragmatic guide to geoparsing evaluation, Language Resources and Evaluation 54 (2020) 683–712. doi:10.1007/s10579- 019- 09475- 3 . [15] E. Kamalloo, D. Rafiei, A coherent unsupervised model for toponym resolution, in: Proceedings of the 2018 World Wide Web Conference, International World Wide Web Conferences Steering Committee, Geneva, Switzerland, 2018, pp. 1287–1296. doi:10.1145/ 3178876.3186027 . [16] X. Hu, Y. Sun, J. Kersten, Z. Zhou, F. Klan, H. Fan, How can voting mechanisms improve the robustness and generalizability of toponym disambiguation?, International Journal of Applied Earth Observation and Geoinformation 117 (2023) Article No. 103191. doi:10. 1016/j.jag.2023.103191 .