=Paper=
{{Paper
|id=Vol-3683/CEUR-Template-2col1
|storemode=property
|title=Where is the News? Improving Toponym Identification and Differentiation in Online News
|pdfUrl=https://ceur-ws.org/Vol-3683/paper5.pdf
|volume=Vol-3683
|authors=Joseph Shingleton,Ana Basiri
|dblpUrl=https://dblp.org/rec/conf/ecir/ShingletonB24
}}
==Where is the News? Improving Toponym Identification and Differentiation in Online News==
Where is the News? Improving Toponym Identification and Differentiation in Online News Joseph Shingleton1,∗ , Ana Basiri1,2 1 School of Geographical and Earth Sciences, The University of Glasgow, United Kingdom 2 The Alan Turing Institute, London, United Kingdom Abstract Understanding the geographical context of unstructured textual data is a key challenge in information extraction. In many applications, however, simple identification of toponyms is insufficient and can often lead to ambiguities in the extracted information. One such application is in the geolocation of online news - where a single article may mention multiple locations, with only one location referring to the article’s subject. In this paper, we present a transformer based model, trained to identify the subject toponym of news articles. Further, our model identifies likely parents of the subject toponym, potentially helping to improve later geolocation tasks. Our model is able to identify the subject toponym of an article with an F1-score of 0.760 when tested on a human-tagged test dataset. Keywords Natural language processing, Geoparsing, Toponym identification, Transformer models 1. Introduction Accurate extraction of geographical information from natural language relies on the ability to reliably identify toponyms within text, and to associate those toponyms with unique geographic locations [1, 2, 3]. Modern toponym extraction methods, such as named-entity-recognition, have been shown to be highly effective in this task [4, 5, 6]. In most cases, however, these tools are limited to applications in which a geographic location is expected to be assigned to each toponym in a piece of text. For many applications this may be perfectly acceptable, and even desirable. However, there are some applications which require a more nuanced approach to toponym identification, such as geotagging of news articles [7, 8]. Often, the subject of a news article is associated with a single geographic location referred to in the text. Frequently, however, other toponyms will appear alongside the subject toponym, either as a way to help geographically identify the subject, or due to some interaction between the subject location and other named locations. For example, in the sentence ”Two firefighters have travelled almost 4,000 miles from the USA to confirm their vows at the Calton Community Fire Station in Glasgow”, it is clear that the subject location of this article is Calton Community Fire Station. The article also mentions the toponyms Glasgow, and USA. In this context, the GeoExT 2024: Second International Workshop on Geographic Information Extraction from Texts at ECIR 2024, March 24, 2024, Glasgow, Scotland ∗ Corresponding author. Envelope-Open joseph.shingleton@glasgow.ac.uk (J. Shingleton); ana.basiri@glasgow.ac.uk (A. Basiri) Orcid 0000-0002-1628-3231 (J. Shingleton); 0000-0002-2399-1797 (A. Basiri) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings toponym Glasgow is used to geographically identify the subject as it is a parent location of the subject. The toponym USA is mentioned due to an interaction between it and the subject toponym within the narrative of the article. Such toponyms can be considered incidental to the subject. In this paper, we present a transformer based model which can identify the subject location of news articles, and differentiate between parent and incidental locations to aid in precise geolocation. The model is trained on data scraped from the BBC-Monitoring website [9], an online news platform which collects news from around the world on topics covering terrorism, conflict, misinformation and political extremism. Various techniques exist for simple toponym recognition in news articles. Samet et al developed a model which combines a rules-based approach to toponym identification with a statistical named entity recognition model, achieving a precision of 0.739 and recall of 0.868 on a corpus of 11,564 news articles, out performing many models existing at the time [10]. Modern transformer based named-entity recognition models, such as Topo-BERT, have been shown to be highly effective in toponym recognition without the need for additional rules-based techniques. The Topo-BERT model achieved an average precision of 0.827 and recall of 0.886 across a range of news and social media data sources [4]. Monteiro et al provide a detailed survey of articles investigating the geographic scope of documents [8]. In this context, geographic scope refers to the identification of a single location, or multiple locations, which provide a broad geographic representation of all (or most) toponyms in a document. This differs from subject toponym identification, which aims to assign a single toponym (or multiple toponyms) in the text as the geographic subject of the document. For example, the sentence ”Firefighters from Motherwell and Edinburgh were called in to help fight the fire in Glasgow” might have the geographic scope of Scotland, as it links the three named toponyms, despite the subject toponym of the article being Glasgow. Monteiro et al allude to this through the identification of geographic semantic scope as an area for future research, in which the semantic meaning of the document is considered alongside explicitly mentioned toponyms. Previous approaches to subject-location identification tend to rely on heuristic models [7]. Such approaches use syntactical and contextual clues, such as a toponym’s occurrence in a headline, its position and/or frequency within in the text, or the relative prominence of an associated location. These approaches, however, can not account for all of the grammatical nuance within natural language which help to identify subject locations, and are unable to identify spatial relationships between locations. Further, rules based approaches may suffer from reduced generalizability, as domain-specific language can lead to rules failing to translate across different types of text [8, 11]. Modern transformer based language models may help to address the poor generalizability of heuristic approaches. A recent paper by Tahmasebzadeh et al. [12] implemented a BERT based transformer model to identify the subject location of news articles. The implementation, however, limits the model to predicting locations from a pre-defined pool, reducing its utility in real world geo-parsing applications. Our model improves on the utility of this approach by classifying toponyms within the text, allowing the model to make predictions on previously unseen locations. In this paper, we propose a transformer based model which is able to identify the subject toponym of a news article and differentiate secondary toponyms in terms of their spatial relationship with the subject. To do this, we use a simple heuristic model to automatically tag a dataset of news articles. This noisy data is used to train a transformer based model, before fine-tuning the model on a smaller manually annotated dataset. The process of fine-tuning, along with the use of a relatively noise-robust transformer model, helps to alleviate some of the noise introduced by the heuristic tagging method. 2. Methods The aim of this paper is to develop a Topo-BERT model trained on the task of subject toponym identification. To achieve this, we first need to construct a dataset of news articles with appropriate toponym tags. This process consists of three steps: accessing and downloading relevant news articles; generic identification of toponyms within articles via named entity recognition, and subsequent re-categorisation of the identified toponyms in terms of their relationship with an article’s subject. The constructed dataset can then be used to train a Topo-BERT model on the task of subject identification and toponym differentiation. 2.1. Collecting news data We use the BBC-Monitoring API [9] to access news articles. In order to link each article to a specific location, we search the API for articles by headline. We use GeoNames [13] to construct a list of 14,696 global cities with population greater than 100,000, and search for any BBC- Monitoring articles which mention each city in the headline. The coordinates of each city are also recorded. Capital cities are removed from the list to avoid metynomic use of capital cities as a reference to a country’s government. In cases where a single name refers to two or more locations, the cities with the smaller population are removed from the list. Headlines which mention more than one place name are also removed from our dataset. Geographic information, including coordinates, and (if available) bounding boxes or spatial polygons, associated with each city are obtained by querying OpenStreetMap’s Nominatim API [14] with each city name. If multiple matches are found then we select only the match which contains the known coordinates of the city and which has the highest OpenStreetMap importance score, since a higher importance is associated with higher populations. This process does introduce some noise into the training data. In particular, we can not be certain that the place referred to in the headline is always the article’s subject location. For example, articles referencing the Nagoya Protocol may be identified as having Nagoya, Japan as a subject. Further, articles referring to places with ambiguous names may result in erroneous geographic information collected from OpenStreetMap. Using the city with the highest population goes some way to addressing this as more news articles are written about places with high populations [15]. 2.2. Toponym identification in news articles For our initial toponym identification, we train a Topo-BERT model to perform NER tagging on the Wiki-Neural [16] and CoNNL-2003 datasets [17]. Our Topo-BERT model is constructed from a large, cased BERT model, which outputs into a one-dimensional convolutional layer with 16 nodes, connected to max-pooling layer. The output of this layer is passed into a fully connected layer with 512 nodes, before finally passing through a soft-max activated output layer [4]. We train the model over 20 epochs, using a weighted masked categorical loss function and an Adam optimizer. The loss function is weighted to help account for the class imbalance in the dataset [18]. 2.3. Relational tagging of news articles All toponyms identified in the previous step are again searched for using the Nominatim API, and the spatial information (points, polygons, or bounding boxes) of all matches are stored. If the toponym cannot be found using the Nominatim API then a further search is completed using the Geonames API, yielding only point information. The extracted geographical information is used to identify spatial relationships between the subject toponym and other toponyms in the text. If the subject location is contained within any polygon (or bounding box) associated with a secondary toponym (across multiple possible matches), then the secondary toponym is tagged as a parent of the subject location. If the subject location contains any polygon (or point) associated with a secondary toponym, then we assume that the more geographically specific location is the actual subject of the article. Hence, the secondary toponym is tagged as the subject toponym, and the previously identified subject toponym is reassigned as a parent. By doing this, we ensure that the subject toponym relates to the most geographically specific location in the text. Any locations which can not be found using either OpenStreetMap or GeoNames, or which have no relationship with the subject under the specified rules, are tagged as incidental locations. This is a common source of noise in the dataset, as there may be locations which have a parent/child relationship with the subject, but which cannot be found using the API tools. In such cases, the tagging method will introduce false negatives to the training data. To address this and other sources of labelling noise, we manually label a subset of 1343 sentences from the training data and use these to fine-tune the model in a final training step. This allows the model to retain some of the relationships learned during the initial noisy training step, while correcting for some of the inaccuracies introduced [19]. For both the noisy training step and the fine-tuning step we use the same model as described in the previous subsection. The model is again trained for 20 epochs in each case. The model which achieves the lowest weighted loss on a validation set (a random sample of 10% of our training set) is saved. As this work serves to act as a proof-of-concept we have not performed any hyper-parameter tuning and do not consider our results to be optimal. 3. Results We use a test set of 200 human-tagged articles to assess the accuracy of the heuristic tagging method and the Topo-BERT model. For the Topo-BERT model, we present the accuracy after training on just the noisy data, just the fine-tuning data, and the noisy data plus the fine-tuning data. The recall, precision, and F1 score for each toponym type are given in table 1. Model Label Precision Recall F1 Subject 0.834 0.636 0.728 Heuristic Parent 0.798 0.634 0.707 Incidental 0.468 0.961 0.629 Subject 0.807 0.687 0.742 Noisy data only Parent 0.687 0.602 0.642 Incidental 0.410 0.669 0.509 Subject 0.712 0.768 0.739 Fine-tuning data only Parent 0.637 0.623 0.630 Incidental 0.489 0.433 0.459 Subject 0.814 0.713 0.760 Noisy data + fine-tuning Parent 0.669 0.768 0.715 Incidental 0.462 0.646 0.539 Table 1: Accuracy of the heuristic tagging method and the Topo-BERT model with and without fine-tuning. The heuristic model achieves an F1 score of 0.728 when identifying subject toponyms and 0.708 when identifying parent toponyms. The precision of the model is generally very good in this category (0.834), with much of the error coming from poor recall (0.646), indicating a low false positive rate and a high false negative rate. A similar pattern is observed in the parent category (F1: 0.702, precision: 0.798, recall: 0.626), indicating that many of the false negatives might be misidentified as belonging to the incidental class. This is expected due to the limitations of the Open Street Map database used to perform the heuristic tagging. Allowing the model to first train on the noisy data, before fine-tuning on the high-quality data improves the accuracy of the model. The final model outperforms the heuristic model on subject toponym identification (F1: 0.760) and parent toponym identification (F1: 0.715). Identification of incidental toponyms is reduced, however. Figure 1 shows that 17% of the toponyms tagged as incidental by the human reviewer are tagged as subject toponyms by the model, and 26% of model tagged incidental locations are tagged as subject locations by the human reviewer. The inability to differentiate between subject and incidental toponyms is likely due to the limited size of the human-tagged fine-tuning data. By increasing the number of high quality samples available in this step we will likely see improvements in the model. 4. Discussion The methods developed in this paper provide a promising indication of the capacity for trans- former based models in the geoparsing of news media. Existing approaches to subject toponym identification tend to rely on purely heuristic models [7], similar to the initial noisy tagging model presented in our work. Such models use specific structural features within text to make predictions. For many articles, however, such structural information (such as a name within a headline, or an expected order or frequency of toponyms) may be missing, leading to mis- leading results and reduced generalizability. A trained machine learning model, however, can (a) (b) BERT model assigned tag Sub 0.71 0.1 0.17 Sub 0.81 0.18 0.26 Human assigned tag Par 0.1 0.77 0.13 Par 0.05 0.67 0.07 Inc 0.18 0.1 0.65 Inc 0.1 0.12 0.46 Sub Par Inc Sub Par Inc Human assigned tag BERT model assigned tag Figure 1: (a) the proportion of Topo-BERT model guesses in the subject (Sub), parent (Par) and incidental (Inc) classes, given the human assigned tag and (b) the proportion of human assigned tags in each category given the model prediction. use grammatical indicators within the text which may be missed by rules-based approaches [20, 21, 22]. Because of this, the Topo-BERT model benefits from wider generalizability to a more diverse set of problems compared to the heuristic model. As such, the utility of the model exceeds the marginal improvements on identification of subject and parent toponyms. Further improvements to model accuracy may be achieved through hyper-parameter tuning, use of noise-robust loss functions, or through increasing the size of the manually annotated fine-tuning dataset. Our model performs well on a human tagged test set, correctly identifying the subject toponym in 71% of cases. Differences in testing data and performance metrics means that it is difficult to draw comparison to existing models. The CLIFF-CLAVIN model [7] achieved an accuracy of 74.1% when identifying the subject country of an article, but has not been tested on city-level extraction. A more recent transformer-based model [12] achieved an accuracy of 48.1% and 53.4% when predicting the city and region of focus respectively. Our model appears to outperform this, however, this should be validated by testing both models on the same test data. A further benefit of our model is its capacity to identify spatial relationships between the subject toponym and other toponyms in the text. This provides more spatial information for later geocoding steps and may improve geocoding accuracy, however this remains to be demonstrated fully. Other spatial relationships which aid in disambiguation have not been considered at this stage. Further work may try to further differentiate between toponyms through identification of locations which are near the subject, or have shared parental lineage. Including these more complex geographical relationships may improve the accuracy of later disambiguation methods [8]. A more nuanced approach to noise handling will likely further improve model accuracy. Approaches which attempt to identify mislabeled data [23], or establish robust classification boundaries [24] can help to reduce the effect of labeling noise on transformer based models. Further work will aim to apply these noise handling techniques to our model to improve classification accuracy. Acknowledgments The authors acknowledge the support from The UK Research and Innovation (UKRI) Future Leaders Fellowship on ”Indicative Data”, MR/S01795X/2, and the Alan Turing Institute-DSO partnership project on ”Multi-Lingual and Multi-Modal Location Information Extraction”. References [1] S. E. Middleton, G. Kordopatis-Zilos, S. Papadopoulos, Y. Kompatsiaris, Location extraction from social media: Geoparsing, location disambiguation, and geotagging, ACM Transactions Information Systems 36 (2018). URL: https://doi.org/10.1145/3202662. doi:10.1145/3202662 . [2] M. Gritta, M. Pilehvar, N. Collier, A pragmetic guide to geoparsing evaluation, Language Resources and Evaluation 54 (2019) 683–712. doi:10.1007/s10579- 019- 09475- 3 . [3] M. Karimzadeh, S. Pezanowski, A. M. MacEachren, J. O. Wallgrün, Geotxt: A scalable geoparsing system for unstructured text geolocation, Trans- actions in GIS 23 (2019) 118–136. URL: https://onlinelibrary.wiley.com/ doi/abs/10.1111/tgis.12510. doi:https://doi.org/10.1111/tgis.12510 . arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/tgis.12510 . [4] B. Zhou, L. Zou, Y. Hu, Y. Qiang, D. Goldberg, Topobert: a plug and play toponym recog- nition module harnessing fine-tuned bert, International Journal of Digital Earth 16 (2023) 3045–3064. URL: https://doi.org/10.1080/17538947.2023.2239794. doi:10.1080/17538947. 2023.2239794 . arXiv:https://doi.org/10.1080/17538947.2023.2239794 . [5] C. Berragan, A. Singleton, A. Calafiore, J. Morley, Transformer based named entity recognition for place name extraction from unstructured text, Inter- national Journal of Geographical Information Science 37 (2023) 747–766. URL: https://doi.org/10.1080/13658816.2022.2133125. doi:10.1080/13658816.2022.2133125 . arXiv:https://doi.org/10.1080/13658816.2022.2133125 . [6] L. Tao, Z. Xie, D. Xu, K. Ma, Q. Qiu, S. Pan, B. Huang, Geographic named entity recognition by employing natural language processing and an improved bert model, ISPRS International Journal of Geo-Information 11 (2022). URL: https://www.mdpi.com/2220-9964/11/12/598. doi:10.3390/ijgi11120598 . [7] C. D’Ignazio, R. Bhargava, E. Zuckerman, L. Beck, Cliff-clavin: Determining geographic focus for news articles, in: Proceddings of the NewskDD: Data Science for News Publishing, 2014. [8] B. R. Monteiro, C. A. Davis, F. Fonseca, A survey on the geographic scope of textual docu- ments, Computers Geosciences 96 (2016) 23–34. URL: https://www.sciencedirect.com/ science/article/pii/S0098300416301972. doi:https://doi.org/10.1016/j.cageo.2016. 07.017 . [9] British Broadcasting Corporation, BBC Monitoring, https://monitoring.bbc.co.uk/, 2024. [10] M. D. Lieberman, H. Samet, Multifaceted toponym recognition for streaming news, in: Proceedings of the 34th International ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 843–852. URL: https://doi.org/10.1145/2009916.2010029. doi:10.1145/2009916.2010029 . [11] X. Li, W. Zhang, Y. Wang, Y. Tan, J. Xia, Spatio-temporal information extraction and geoparsing for public chinese resumes, ISPRS International Journal of Geo-Information 12 (2023). URL: https://www.mdpi.com/2220-9964/12/9/377. doi:10.3390/ijgi12090377 . [12] G. Tahmasebzadeh, E. Müller-Budack, S. Hakimov, R. Ewerth, Mm-locate-news: Mul- timodal focus location estimation in news, in: MultiMedia Modeling - 29th Interna- tional Conference, MMM 2023, Bergen, Norway, January 9-12, 2023, Proceedings, Part I, volume 13833 of Lecture Notes in Computer Science, Springer, 2023, pp. 204–216. URL: https://doi.org/10.1007/978-3-031-27077-2_16. doi:10.1007/978- 3- 031- 27077- 2\_16 . [13] GeoNames, GeoNames, http://geonames.org/, 2024. [14] OpenStreetMap contributors, Planet dump retrieved from https://planet.osm.org , https: //www.openstreetmap.org, 2017. [15] E. Avraham, Cities and their news media images, Cities 17 (2000) 363–370. URL: https: //www.sciencedirect.com/science/article/pii/S0264275100000329. doi:https://doi.org/ 10.1016/S0264- 2751(00)00032- 9 . [16] S. Tedeschi, V. Maiorca, N. Campolungo, F. Cecconi, R. Navigli, WikiNEuRal: Com- bined neural and knowledge-based silver data creation for multilingual NER, in: Find- ings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics, Punta Cana, Dominican Republic, 2021, pp. 2521–2533. URL: https://aclanthology.org/2021.findings-emnlp.215. [17] E. F. Tjong Kim Sang, F. De Meulder, Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition, in: Proceedings of the Seventh Con- ference on Natural Language Learning at HLT-NAACL 2003, 2003, pp. 142–147. URL: https://www.aclweb.org/anthology/W03-0419. [18] G. King, L. Zeng, Logistic regression in rare event data, Political Analysis 9 (2001) 137–163. doi:https://doi.org/10.1093/oxfordjournals.pan.a004868 . [19] S. Ahn, S. Kim, J. Ko, S.-Y. Yun, Fine tuning pre trained models for robustness under noisy labels, 2023. arXiv:2310.17668 . [20] V. Nastase, P. Merlo, Grammatical information in BERT sentence embeddings as two- dimensional arrays, in: Proceedings of the 8th Workshop on Representation Learning for NLP (RepL4NLP 2023), 2023, pp. 22–39. URL: https://aclanthology.org/2023.repl4nlp-1.3. doi:10.18653/v1/2023.repl4nlp- 1.3 . [21] G. Jawahar, B. Sagot, D. Seddah, What does BERT learn about the structure of language?, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2019, pp. 3651–3657. URL: https://aclanthology. org/P19-1356. doi:10.18653/v1/P19- 1356 . [22] H. J. Shin, J. Y. Park, D. B. Yuk, J. S. Lee, BERT-based spatial information extraction, in: Proceedings of the Third International Workshop on Spatial Language Understanding, 2020, pp. 10–17. URL: https://aclanthology.org/2020.splu-1.2. doi:10.18653/v1/2020.splu- 1.2 . [23] S. Wang, Z. Tan, R. Guo, J. Li, Noise-robust fine-tuning of pretrained language models via external guidance, in: H. Bouamor, J. Pino, K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics, Singapore, 2023, pp. 12528–12540. URL: https://aclanthology.org/2023.findings-emnlp.834. doi:10.18653/v1/2023.findings- emnlp.834 . [24] R. Liu, S. Mo, J. Niu, S. Fan, CETA: A consensus enhanced training approach for denoising in distantly supervised relation extraction, in: N. Calzolari, C.-R. Huang, H. Kim, J. Puste- jovsky, L. Wanner, K.-S. Choi, P.-M. Ryu, H.-H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, S.-H. Na (Eds.), Proceedings of the 29th International Conference on Computational Linguistics, Interna- tional Committee on Computational Linguistics, Gyeongju, Republic of Korea, 2022, pp. 2247–2258. URL: https://aclanthology.org/2022.coling-1.197.