Geolocation and Named Entity Recognition in Ancient Texts: A Case Study about Ghewond’s Armenian History Marcella Tambuscio1 , Tara Lee Andrews1,2 1 Austrian Center for Digital Humanities and Cultural Heritage (ACDH-CH), Austrian Academy of Sciences, 1010 Vienna (Austria) 2 University of Vienna, 1010 Vienna (Austria) Abstract We present here a discussion about different methods to perform Named Entity Recognition tasks in order to extract geographic entities from the English translation of an Armenian text of the eighth century. Even though many tools are available and perform quite well with modern English, in this case they are only able to detect a very low percentage of the named geographic places. We compared four existing tools: NLTK and spaCy Python libraries, among the most used for NER tasks, TagMe, an entity linking tool that provide an annotation of found entities with Wikipedia pages, and Flair, a PyTorch library. We set these tools in order to select only geographical entities and we also tried two mixed methods: the best results on our data-set have been obtained by combining Flair and TagMe outputs with geographical clustering. Keywords named entity recognition, natural language processing, clustering, historical corpora 1. Introduction As ancient and medieval texts increasingly become available in digital form, possibilities are opened for historians to perform not only traditional critical analysis, but also computationally- supported analysis of their contents. Alongside this, the explosion in technological possibilities clearly presents many opportunities to enrich the text, such as adding multimodal information or automatically extracting and highlighting some relevant information using natural language processing (NLP) techniques such as, for example, Named Entity Recognition (NER): iden- tifying named entities that appear in unstructured texts and classifying them into categories (such as person, location, organization etc.). A related but slightly more complex task is Named Entity Linking (NEL), that adds the disambiguation step. We focus here on a par- ticular sub-task of NER/NEL: the automatic extraction of geographical names in historical corpora. Although existing tools can achieve remarkable results with modern documents in the major world languages [32, 27], ancient and medieval texts pose particular challenges due to linguistic and geographical changes that take place over time, especially where the names or boundaries of those places or locations have changed. We provide here a case study using the English translation of an Armenian text of the VIII century, the History of Ghewond, for which (despite the robust support in general for English-language texts) existing NER toolkits performed quite badly. We tested four different tools (NLTK, Spacy, TagME, and Flair) to CHR 2021: Computational Humanities Research Conference, November 17–19, 2021, Amsterdam, The Netherlands £ marcella.tambuscio@oeaw.ac.at (M. Tambuscio); tara.andrews@univie.ac.at (T.L. Andrews) DZ 0000-0003-2097-1333 (M. Tambuscio); 0000-0001-6930-3470 (T.L. Andrews) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Wor Pr ks hop oceedi ngs ht I tp: // ceur - SSN1613- ws .or 0073 g CEUR Workshop Proceedings (CEUR-WS.org) 136 extract geographical names and we compared the results. F-measures appeared to exhibit low values when confronted with the ones usually obtained with contemporary texts. We show that, by combining the outputs of the two best-performing tools, TagMe and Flair, with a geo- graphical clustering, we can substantially improve their performance. We additionally modified the output of the tool TagMe that links the entities in the text to their respective Wikipedia pages, so that using the Wikipedia API we can select geographical places and directly extract the coordinates (when available) to create maps. Our main goal here is twofold: on the one hand, to compare the performance of four well-known NER toolkits on a challenging dataset (that is part of a larger collection of texts) and discuss the possible reasons of unsatisfying results; on the other hand, starting from the slightly more promising results of the toolkits trained on Wikipedia data, to explore new ways to retrieve information from this knowledge base to improve the results and lay the foundations to train a new and more performative model in the future. 2. Related Work Named‐entity recognition in historical texts is known to be a challenging problem [38] and es- pecially for geographical places that have changed name over time or ceased to exist: empirical evidence suggests indeed that the more recent the texts, the more entities could be detected [13]. First, researchers have tackled the problem by developing different NER methods for identi- fying place references in a specific text or corpus, often enriched through the use of gazetteers. Most of them are rule-based or make use of the NLP Stanford NER tool: historical newspa- per collections [12, 25, 26, 33, 28], literature [9, 10], British parliamentary proceedings [19], Turkish texts [26], Arabic historical texts [6, 36], Latin corpora [14], UK census data [30]. Similarly, some more general platforms have been created in order to be applied to different data-sets: the VARD tool [5] pre-processes historical corpora to propose modern equivalents alongside historical spelling variants; a digital geo-temporal gazetteer has been proposed in [29]; the Edinburgh Geoparser [2] and Recogito [37] recognise mentions of place names in text and assist in their disambiguation with respect to existing gazetteers. On the other hand, in [20] and [18] the authors provide a comparison of some existing geoparsers that make use of gazetteers, showing that they still have several limitations. Secondly, in the last two decades researchers have been discussing the importance of Ge- ographic Information Systems (GIS) technologies in the pursuit of historical research [16, 4, 17] and the necessity of introducing unsupervised methods which would allow a move from rule-based systems toward more data-driven approaches [11]. In [31] the authors discuss a combination of spatial analysis and natural language processing techniques in the field of ar- chaeology. A mixed method that combines NLP and geospatial clustering has been proposed in [23] to identify places in housing advertisements: even if the inputs here are not historical data, the challenge is similar since many local place names either have not been registered in gazetteers or appear in abbreviated forms that do not appear. Thirdly, some recent work has suggested that results improve when several tools are com- bined. In [39] the authors propose a method that considers five NER tools through a voting system, which produced a better performance than any single tool. Similarly, GeoTxt [24] is a geoparser for unstructured streaming text that supports multiple NER methods. 137 3. The Dataset The History of Ghewond covers events in and around Armenia from ca. 632 to 788, focusing on the Arab domination of the region and especially the transition from Umayyad to Abbasid rule and its effects on Armenian politics and society. It is a short text (25K words) but the geographical range is broad, covering the centers of power of the Caliphate, the territories both of the former Roman Armenia and the so-called Persarmenia, as well as references to Byzantine and Khazar places. The impetus for the study was the desire to identify and collect the context for place names in works of Armenian history, while being fully aware that NER tools for the Classical Armenian language are not in a particularly advanced state of development. The solution that seemed obvious was to analyse modern English translations of the texts. For the initial attempt we used the English translation of Ghewond’s History made in 2006 by Robert Bedrosian, who is well known for his translations of several works of Armenian history. Bedrosian’s translation style is to stay as close as possible to the original text phrasings, and in particular to render all proper names in a direct transliteration from their forms in the Armenian alphabet. While this is a welcome and helpful translation strategy from the perspective of historians of the medieval Caucasus, the place names themselves can be difficult both for untrained human readers and for neural networks to recognise. Adding to the complication is the fact that, in medieval Armenian society, territories and their ruling clans often carried the same names; this means that, for example, any occurrence of a name such as “Rshtunik” must be examined for context to determine whether this is a mention of a place or a group of persons. In order to compare the outputs of several NER approaches, we manually created a gold standard to list the geographical entities and their occurrences: we found 199 entities with 303 total occurrences in the text. The most frequent are Armenia, Judaea, Vaspurakan, Byzantine territory, Damascus, Byzantium, Asorestan. 4. Methods All the codes and the data can be found on GitHub 1 . 4.1. NER tools First of all we briefly describe here the tools that we used. NLTK (Natural Language Toolkit)2 and spaCy3 are libraries for NLP written in Python. They were originally developed for English and perform tasks such as tokenization, classifica- tion and part-of-speech tagging. NLTK [7, 8], developed by the University of Pennsylvania, is intended to support research while spaCy [22], published under the MIT license, is more application-oriented. The set of labels offered by the standard English models of NLTK and spaCy libraries include a GPE label for geopolitical entities as countries, cities, states and a LOC label for physical locations as mountain ranges, rivers, seas. We selected entities detected in our text with both labels. Flair is a relatively new library, open source and developed in Python by the Humboldt University of Berlin and Zalando Research, that offers support for common NLP tasks including Named Entity Recognition [1]. This service seems to be more powerful than spaCy, but it must 1 https://github.com/tambu85/ancient_text_NER 2 https://www.nltk.org/ 3 https://spacy.io/ 138 be observed that Flair is (a bit) slower and is currently available for only a few languages. For our purposes we selected only entities classified with the LOC label. TagMe4 is an impressive entity linking tool, developed by the University of Pisa[15], that identifies meaningful spots in an unstructured text and links each of them to a pertinent Wikipedia page. For this reason it has been used for disambiguation tasks [35]. TagMe usually performs very well with short texts, but it can also be used on longer ones. Given an input text, the TagMe API 5 provides a list of annotations, meaning a list of pairs (spot,entity), where each spot is a substring of the input text and each entity is a reference to a unique Wikipedia page representing the meaning of that spot, in that context. TagMe computes for each entity a link probability lp that measures how frequently the spot text is used to link exactly that entity page. Moreover, TagMe associates a value ρ (rho) to each annotation, which estimates a confidence score of the annotation among the possible entities. In TagMe there is also a parameter that can be used to fine-tune the disambiguation process, either to select the most common topics for a spot or to take the context of each spot more into account (we selected this second option). This parameter could be useful when annotating particularly fragmented text, such as tweets, where it would be better to favor the most common topics because the context is less reliable for disambiguation. Supported values are floats in the range [0,0.5], default is 0.3. It should be noted that TagMe itself does not provide any sort of classification of the entities and it was not designed for this task. Nevertheless, we noticed that it was able to detect many geographical entities that the other tools missed: one reason could be that Wikipedia often reports also the ancient names of places. Then we added our own basic classification by filtering the results using a SPARQL query through the Wikidata Query Service6 to select only geographical places. Moreover, when the geographical coordinates were available in the page, we added them to the output: in the appendix we provide all the statistics about this extension of TagMe. 4.2. Mixed approaches We applied the above-mentioned tools to our historical text and evaluated the results, com- paring them with a gold standard of manual annotations: in the next section we will provide the well-known evaluation measures. None of the tools provided exceptional results, but the best ones were obtained by TagMe and Flair, even if these tools are significantly slower than spaCy and NLTK. We then tried to improve the results, proposing two other methods: • M1 we simply considered the union of the results of TagMe and Flair; • M2 we ran a geographical clustering to discard non-meaningful TagMe results and then we considered the union of these filtered results with Flair. Note that in M2 we are considering only the entities with coordinates, and among them we perform another selection with clustering. We chose DBSCAN, a popular density-based clustering algorithm, and we set the parameters (ϵ = 10, minpts = 30) following the standard procedure and computing the k-nearest neighbors (k-NN) for different values of k [21]. In section 5 we show how this is a valid approach to remove many false positives from the TagMe output. 4 https://sobigdata.d4science.org/web/tagme/tagme-help 5 An official wrapper for Python is available here: https://github.com/marcocor/tagme-python 6 https://query.wikidata.org/ 139 4.3. Validation and Evaluation We selected the LOC and GPE entities detected by NLTK, spaCy and Flair and the Wikipedia entities labeled as geographical ones by our SPARQL query. Then we validated the results by comparing them with a gold standard that we produced manually. We had to define some rules for some corner cases: • Ethnonym references. Many entities are of this form: Armenian (lords), land of the Aghuanians or Byzantine territory. Nevertheless, we have to report that most of the time such expressions are not detected by any of the methods described (with some exception recognized by TagMe and Flair). We decided to consider as true positives only expressions that include a geographical element such as the country of Armenians or Byzantine territory. • Temporally ambiguous results. An example is Syria, for which TagMe occasionally provides a link to the modern republic, while in the text the author is referring to the Roman province; these two geographical areas overlap but are not coterminous. In the Appendix we report the percentage of the completely matching entities. In this context, in order to compare TagMe usefully with the other tools, we did not consider the entity linking (EL) evaluation, but only the NER task: the annotation is labeled as correct if the spot is indeed a location in essentially the right place. Conversely, when we had to deal with geographical clustering (mixed method M2) only the entities with the correct coordinates were considered valid. Another example is the entity Iberia that TagMe associates (in different occurrences) to two different Wikipedia pages:Iberian Peninsula and Kingdom of Iberia. For NER task, both are considered correct since Iberia is a geographical entity in the text, while for EL task in M2 only the second is considered correct. To evaluate the results we used the three common measures in classification tasks: precision, recall and F1-measure [3, 34]. Here we briefly review their formula and meaning in our context. Our task is to find geographical entities and we have a gold standard (the correct list) to compare them. For each tool we can count: • TP (true positives), number of entities correctly labeled as locations; • FP (false positives), number of entities incorrectly labeled as locations; • FN (false negatives), number of entities which were not labeled as locations but should have been. From this we can compute precision and recall: TP P recision = (1) TP + FP TP Recall = (2) TP + FN These measures are not particularly useful metrics if used individually, since a high value for one of them does not necessarily represent a good performance. Instead, we consider the F1-measure (or F-score), the harmonic mean of the previous measures, to evenly weight them: 2 ∗ P recision ∗ Recall F1 = (3) P recision + Recall The F1-measure produces values in [0, 1] where 1 represents perfect precision and recall. 140 Tool Precision Recall F1-measure NLTK 0.404 0.405 0.404 spaCy 0.668 0.385 0.488 TagMe 0.537 0.473 0.503 Flair 0.772 0.677 0.721 M1 0.589 0.796 0.677 M2 0.738 0.760 0.748 Table 1 Precision, recall, and F-measure for each method tested. M2, a combined approach of Flair, TagMe and geographical clustering gives the best F1-score. 5. Results In Table 1 we give the evaluation measures that were computed for the different methods that we used to detect entities defining locations in Ghewond’s History. The values are surprisingly low if compared with the average results on known datasets, where these NER tools usually give F-measures that far exceed 0.9. These low values represent a clear example of how the NER task can still present research challenges, especially on historical texts. Some comments about the comparison: • NLTK is the worst-performing tool, with very low values for each measure; • spaCy performs a little better than NLTK in terms of precision, meaning that there is a smaller fraction of FP (entities incorrectly labeled as locations) but the recall is very low due to a high number of missed entities (FN); • TagMe performs slightly better than the previous two, with similar values for precision and recall; • Flair is the best among the four tools and exhibits the best precision score among all the proposed methods. It is important to note that low recall scores are partially a consequence of our choice to consider as TP expressions like country / land of the Armenians since they appear many times in the text but are only rarely captured by the tools. This is not necessarily a problem, since we could manually tune many of these tools by adding some specific context-based pattern matching rules to detect such expressions. While Flair clearly exhibited the best performance, it must be observed that TagMe detected many entities that were not captured by the other three, particularly locations of places referred to using obsolete names (for instance the Roman Province Judaea or the medieval name of Istanbul, Constantinople). This is due to the fact that TagMe links entities in the text to their Wikipedia entries, which are often reachable under the several names by which these places were known during different eras. Moreover, by adding the geographical coordinates we were able to enrich the output and, since Wikipedia is constantly growing, when repeating the experiment on Ghewond’s History or other similar texts we can expect to obtain ever better results. Since there was a significant subset of entities detected only by TagMe, our next experiment was to use a mixed approach M1, where we considered the union of the outputs given by Flair 141 60 40 latitude T lat 20 0 TRUE FALSE −50 0 50 100 lon longitude 60 40 Marcella latitude lat 20 fill TRUE 0 FALSE noise MISSED grey −50 0 50 100 lon longitude Figure 1: Comparison of the set of entities with coordinates found by TagMe (top) and result of DBSCAN algorithm (ϵ = 10, minpts = 30) over the same set (bottom). In particular we want to show that discarding the noise discriminated by the clustering algorithm we can remove a significant portion of FALSE POSITIVE (red) and isolate the TRUE POSITIVE (blue). and TagMe. This gave the best recall score but a lower precision with respect to Flair by itself: this means that (as would be expected) Flair and TagMe together are less likely to miss locations, but still produce a large number of entities incorrectly labeled as places. Since Flair had a high value for precision, the imprecision of the M1 approach clearly originates in TagMe. To partially mitigate this problem, we made use of the coordinates returned by TagMe: we selected only those returned entities with geographical coordinates and we ran a clustering algorithm to detect noise. In Figure 1 (top) we plot the validated entities with coordinates that were found within the text using TagMe and Wikipedia queries: blue circles are the true positives, while red crosses are the false positives. The validation used in this case is a strict one: if an entity appears as a location, but the coordinates are wrong, we label it as a false positive. We can observe that the true positives are quite clustered so we ran DBSCAN, an unsupervised clustering algorithm: 142 we remark that (as yet) we are not interested in how many or which clusters the algorithm detected; we have merely used the clustering to remove many false positives, labeled as noise. In Figure 1 (bottom) we show that this method works quite well, which led us to consider only the entities within the clusters for the M2 method. Finally we tried the second mixed approach M2, in which we consider (as in M1) the union of the outputs given by Flair and TagMe with a clustering step. As shown in the last row of Table 1, M2 was the best among the six approaches: even if it is not optimal, we obtained the highest F-score and a good balance among precision and recall. In the appendix we discuss the results in detail, considering the advantages and disadvantages of the various approaches. 6. Conclusions In this paper we propose a case study on automatic detection of geographical entities in a corpus originally written in a medieval minority language. Although an English translation of the text was available for use, the fact that the place names in the text refer to a totally different geopolitical system is actually an obstacle for a system trained on modern English, meaning that “out-of-the-box” NER tools fail most of the time. Indeed, NLTK and spaCy, the most well-known tools for NER tasks, obtained very low F-measures on our corpus. Our attempts with TagMe (designed for Entity Linking) and Flair, two different tools both trained on Wikipedia data, do provide better results. Although these two tools are significantly slower than NLTK and spaCy, and the execution time can be very important for NLP tasks on streaming and real-time data, we do not consider this a major problem for a NER task run on a historical text, since it is likely to be ran only once. Moreover, the modified version of TagMe that we have devised is even slower due to the use of the Wikipedia API to get coordinates, which is also impacted by a rate limitation on requests. Concerning the quality of performance of these tools, we must bear in mind that, since Wikipedia is an online encyclopedia maintained by a community of volunteer editors, it continuously changes over time: this means that the results of a repeated analysis could vary (hopefully for the better), or even that larger common knowledge databases could offer alternative solutions in the future. Finally, we tried two mixed approaches and found that by combining Flair and TagMe results with clustering techniques we were able to significantly improve their performance. It is, however, important to note that this approach can also depend on the type of data: since we knew that most of the events described in the text happened in a circumscribed area, clustering was helpful in that it allowed us to discard some entities that were wrongly classified as places. This could also be the case for other historical texts, but more detailed research on larger datasets could provide new insights about the usefulness of geographical clustering for entities. We see different future steps for this research line: (i) this is an initial a case study, so more tests are needed on other corpora which could also include some comparison with other tools that gave similar F1 scores on other datasets [24]; (ii) the detection process using TagMe and geographical clustering could still be improved; (iii) the mixed approach of Flair and TagMe (maybe perhaps with additional metadata from Wikipedia) could also be used for other types of entities such as person and organisation names. Acknowledgments Thanks to the developers of NLTK, spaCy, Flair and TagMe. 143 References [1] A. Akbik, D. Blythe, and R. Vollgraf. “Contextual String Embeddings for Sequence La- beling”. In: COLING 2018, 27th International Conference on Computational Linguistics. 2018, pp. 1638–1649. [2] B. Alex, K. Byrne, C. Grover, and R. Tobin. “Adapting the Edinburgh geoparser for historical georeferencing”. In: International Journal of Humanities and Arts Computing 9.1 (2015), pp. 15–35. [3] R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval. Vol. 463. ACM press New York, 1999. [4] T. J. Bailey and J. B. Schick. “Historical GIS: enabling the collision of history and geography”. In: Social Science Computer Review 27.3 (2009), pp. 291–296. [5] A. Baron and P. Rayson. “VARD2: A tool for dealing with spelling variation in historical corpora”. In: Postgraduate conference in corpus linguistics. 2008. [6] M. A. Bidhendi, B. Minaei-Bidgoli, and H. Jouzi. “Extracting person names from ancient Islamic Arabic texts”. In: Proceedings of Language Resources and Evaluation for Religious Texts (LRE-Rel) Workshop Programme, Eight International Conference on Language Resources and Evaluation (LREC 2012). 2012, pp. 1–6. [7] S. Bird. “NLTK: the natural language toolkit”. In: Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions. 2006, pp. 69–72. [8] S. Bird, E. Klein, and E. Loper. Natural Language Processing with Python. 1st. O’Reilly Media, Inc., 2009. [9] L. Borin, D. Kokkinakis, and L.-J. Olsson. “Naming the past: Named entity and ani- macy recognition in 19th century Swedish literature”. In: Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007). 2007, pp. 1–8. [10] J. Brooke, A. Hammond, and G. Hirst. “GutenTag: an NLP-driven tool for digital human- ities research in the Project Gutenberg corpus”. In: Proceedings of the Fourth Workshop on Computational Linguistics for Literature. 2015, pp. 42–47. [11] T. Brown, J. Baldridge, M. Esteva, and W. Xu. “The substantial words are in the ground and sea: computationally linking text and geography”. In: Texas Studies in Literature and Language 54.3 (2012), pp. 324–339. [12] G. Crane and A. Jones. “The challenge of virginia banks: an evaluation of named entity analysis in a 19th-century newspaper collection”. In: Proceedings of the 6th ACM/IEEE- CS joint conference on Digital libraries. 2006, pp. 31–40. [13] M. Ehrmann, G. Colavizza, Y. Rochat, and F. Kaplan. “Diachronic evaluation of NER systems on old newspapers”. In: Proceedings of the 13th Conference on Natural Lan- guage Processing (KONVENS 2016). Conf. Bochumer Linguistische Arbeitsberichte. 2016, pp. 97–107. [14] A. Erdmann, C. Brown, B. D. Joseph, M. Janse, P. Ajaka, M. Elsner, and M.-C. de Marneffe. “Challenges and solutions for Latin named entity recognition”. In: COLING 2016: 26th International Conference on Computational Linguistics. Association for Com- putational Linguistics. 2016, pp. 85–93. 144 [15] P. Ferragina and U. Scaiella. “Tagme: on-the-fly annotation of short text fragments (by wikipedia entities)”. In: Proceedings of the 19th ACM international conference on Infor- mation and knowledge management. 2010, pp. 1625–1628. [16] M. F. Goodchild and L. L. Hill. “Introduction to digital gazetteer research”. In: Interna- tional Journal of Geographical Information Science 22.10 (2008), pp. 1039–1044. [17] I. N. Gregory and A. Hardie. “Visual GISting: bringing together corpus linguistics and Geographical Information Systems”. In: Literary and linguistic computing 26.3 (2011), pp. 297–314. [18] M. Gritta, M. T. Pilehvar, N. Limsopatham, and N. Collier. “What’s missing in geo- graphical parsing?” In: Language Resources and Evaluation 52.2 (2018), pp. 603–623. [19] C. Grover, S. Givon, R. Tobin, and J. Ball. “Named Entity Recognition for Digitised Historical Texts.” In: Lrec. 2008. [20] C. Grover, R. Tobin, K. Byrne, M. Woollard, J. Reid, S. Dunn, and J. Ball. “Use of the Edinburgh geoparser for georeferencing digitized historical collections”. In: Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 368.1925 (2010), pp. 3875–3889. [21] M. Hahsler, M. Piekenbrock, and D. Doran. “dbscan: Fast density-based clustering with R”. In: Journal of Statistical Software 91.1 (2019), pp. 1–30. [22] M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd. spaCy: Industrial-strength Natural Language Processing in Python. 2020. doi: 10.5281/zenodo.1212303. [23] Y. Hu, H. Mao, and G. McKenzie. “A natural language processing and geospatial cluster- ing framework for harvesting local place names from geotagged housing advertisements”. In: International Journal of Geographical Information Science 33.4 (2019), pp. 714–738. [24] M. Karimzadeh, S. Pezanowski, A. M. MacEachren, and J. O. Wallgrün. “GeoTxt: A scalable geoparsing system for unstructured text geolocation”. In: Transactions in GIS 23.1 (2019), pp. 118–136. [25] K. Kettunen, E. Mäkelä, T. Ruokolainen, J. Kuokkala, and L. Löfberg. “Old content and modern tools-searching named entities in a Finnish OCRed historical newspaper collection 1771-1910”. In: arXiv preprint arXiv:1611.02839 (2016). [26] D. Küçük et al. “Named entity recognition experiments on Turkish texts”. In: Interna- tional Conference on Flexible Query Answering Systems. Springer. 2009, pp. 524–535. [27] J. Li, A. Sun, J. Han, and C. Li. “A survey on deep learning for named entity recognition”. In: IEEE Transactions on Knowledge and Data Engineering (2020). [28] S. Mac Kim and S. Cassidy. “Finding names in trove: named entity recognition for Aus- tralian historical newspapers”. In: Proceedings of the Australasian Language Technology Association Workshop 2015. 2015, pp. 57–65. [29] H. Manguinhas, B. Martins, and J. Borbinha. “A geo-temporal web gazetteer integrating data from multiple sources”. In: 2008 Third international conference on digital informa- tion management. Ieee. 2008, pp. 146–153. [30] P. Murrieta-Flores, A. Baron, I. Gregory, A. Hardie, and P. Rayson. “Automatically analyzing large texts in a GIS environment: The registrar general’s reports and cholera in the 19th century”. In: Transactions in GIS 19.2 (2015), pp. 296–320. 145 [31] P. Murrieta-Flores and I. Gregory. “Further frontiers in GIS: Extending spatial analysis to textual sources in archaeology”. In: Open Archaeology 1.open-issue (2015). [32] D. Nadeau and S. Sekine. “A survey of named entity recognition and classification”. In: Lingvisticae Investigationes 30.1 (2007), pp. 3–26. [33] C. Neudecker, L. Wilms, W. J. Faber, and T. van Veen. “Large-scale refinement of digital historic newspapers with named entity recognition”. In: Proc IFLA Newspa- pers/GENLOC Pre-Conference Satellite Meeting. 2014. [34] D. Powers. “Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation”. In: Journal of Machine Learning Technologies 2.1 (2011), pp. 37–63. [35] M. Rovera, F. Nanni, S. P. Ponzetto, and A. Goy. “Domain-specific named entity dis- ambiguation in historical memoirs”. In: CEUR Workshop Proceedings. Vol. 2006. Rwth. 2017, Paper–20. [36] K. Shaalan. “A survey of arabic named entity recognition and classification”. In: Com- putational Linguistics 40.2 (2014), pp. 469–510. [37] R. Simon, E. Barker, L. Isaksen, and P. de Soto Cañamares. “Linking early geospatial documents, one place at a time: annotation of geographic documents with Recogito”. In: e-Perimetron 10.2 (2015), pp. 49–59. [38] S. Van Hooland, M. De Wilde, R. Verborgh, T. Steiner, and R. Van de Walle. “Explor- ing entity recognition and disambiguation for cultural heritage collections”. In: Digital Scholarship in the Humanities 30.2 (2015), pp. 262–279. [39] M. Won, P. Murrieta-Flores, and B. Martins. “ensemble named entity recognition (ner): evaluating ner Tools in the identification of Place names in historical corpora”. In: Fron- tiers in Digital Humanities 5 (2018), p. 2. 146 Figure 2: Scatterplot of the detected entities by TagMe according to their values of confidence and link probability. Each point (ρ, lp) represents a detected entity with confidence ρ and link probability lp; its color discriminates true positive (blue) and false positive (false). A. Detailed analysis of TagMe results In the course of our work we also explored whether we can use the values of the parameters ρ and lp to better tune the TagMe tool. These values can be used to discard annotations that are below a given threshold. First of all we visualize the True Positive (TP) and False Positive (FP) in a scatter plot (see Fig. 2). Even if there are not well-separated clusters, we can see that there is a higher density of FP for low values of ρ and lp. We then computed how precision, recall and F1-measures vary when moving the thresholds of ρ and lp in [0, 1]. When we fix a threshold τ all the TP obtained for values lower than τ become FN (missed entities). Results are shown in Fig.3 (varying one parameter at a time) and Fig.4 (varying both parameters). As we can see, the F-measure slowly decreases at the beginning and then falls. This accords with the recommendation of the TagMe authors, who indicate values between 0.1 and 0.3 as reasonable threshold for ρ. In our case, we set 0.1 as threshold for ρ and 0.05 for lp (the recommended standard), but such an approach should be repeated in other data sets to explore the role of these parameters. Finally in Tab.2 we report the rates of TP associated with the correct Wikipedia entity and coordinates, obtained with a double manual validation: TagMe identified the right entity 81% of the times over all detected geographical entities and provided coordinates for 71% of them. By combining both we obtained the result that 59% of the detected entities are linked to the right Wikipedia pages that exhibit coordinates. This could surely help in automatically providing a map of the different entities. 147 Figure 3: Precision, Recall and F1-measure changes fixing a threshold on ρ (left) and on lp (right) to discard some results. Figure 4: Heatmaps that show how the Precision, Recall and F1-measures change by fixing a threshold on both ρ and lp (right) to discard some results. For each point (ρi ,lpj ) we compute the measures selecting only values that exhibit a confidence ρ ≥ ρi and a linking probability lp ≥ lpj . Table 2 Rates of entities correctly detected in Wikipedia, entities with coordinates and entities correctly detected in Wikipedia with coordinates. Right Entity Linking Coordinates Right Entity Linking and Coordinates 0.81 0.72 0.61 148