T. Riechert, F. Beretta, G. Bruseker (Ed.) RODBH 2019, 42 Proceedings of the Doctoral Symposium on Research on Online Databases in History 2019 Automated Geo-resolution of Place Names in Historical Serial Sources Erik Radisch1 Abstract: This paper is a presentation of the historic place name locator2, an algorithm, which provides a solution for an automated geo-resolution of place names in historical serial sources. It proposes an approach, which takes historical boundaries into account and can handle variations in writing. This approach can greatly contribute not only to save the content of a source in a database but access its historical meaning. Keywords: Historical Place Names; Geo-resolution; GIS; Metadata Enrichment 1 Introduction One of the biggest challenges in building and maintaining semantic databases for historical topics is not only to store the content of the sources, but also to make their meaning accessible. This problem is also present in other semantic databases, yet it takes on a larger dimension with historical topics. Different spellings, the use of terms that are outdated and the loss of „domain knowledge“(place names that are forgotten) make the accessibility of meaning much more difficult. This paper presents an approach which dwell on how the access to an important information carrier in historical serial sources – place names or to be more precise the geo-resolution of them – can be automated. Automated geo-resolution of historic place names had been already addressed in several approaches. Yet, those approaches were either highly specialized for very particular problems, which cannot be generalized like for example the solution of Schürer et al [SPS15]. Their solution for automated geo-resolution is highly convincing, yet was particularly coded for their very specific three-level-place-name source (parish, county and country). Or they sacrificed potential domain knowledge for the generalization of the algorithm like in the case of the Edinburgh Geoparser. This geoparser gives only rudimentary possibilities to include a historic context of place names in form of a bounding box (a bounding box of the German Empire would include hole Bohemia and a large part of Poland as well). 1 Saxon Academy of Sciences and Humanities in Leipzig, Germany radisch@saw-leipzig.de 2 https://github.com/erikradisch/historic-place-name-locator Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). c b Automated Geo-resolution of Place Names in Historical Serial Sources 43 This paper presents an algorithm which tries to overcome these limitations. The historic place name locator3 is an algorithm, which enables automated geo-resolution of large numbers of place names. This has been achieved by three important main features of the program: A collated gazetteer, a custom search algorithm and the inclusion of the historical context. 2 Gazetteers There is no doubt, that the choice of the gazetteer can fundamentally influence the search result. Search results can always be only as good as the chosen gazetteers was. There are some global gazetteers such as Geonames, which reach an impressive coverage, yet they do have often gaps concerning deserted towns or historic names. For this, specialized gazetteers like the historic gazetteer could over better coverage, yet such gazetteers do not even closely reach the coverage of global ones. Thus if one wants to reach high generalizability, it is inevitable to combine several different gazetteers. The historic place name locator combines several different place name gazetteers. The most important are: Geonames, the historic Gazetteer (GOV), Wikidata and Open Street Map (OSM)4. 3 Search Algorithm Historic place names do often appear in sources with spelling variations. Those outdated name variations do have a very little chance to appear in current gazetteers. Thus it is very important to implement a search algorithm, which can also deal with spelling variations. In the historic place name locator, a complex search routine consisting of three different similarity search algorithms is implemented. Two of them are the approximate string matching algorithms Damerau-Levenshtein [Da64] and Jaro-Winkler-Distances [Wi90]. Third one is a phonetic algorithm. Here, the user can choose between the Cologne Phonetics [Po69] and Double Metaphone [Ph00]. Those three different algorithms enable the historic place name locator to even out spelling differences between historic sources and current gazetteers. 3 The algorithm and a manual how to use it can be found under the following url: https://github.com/ erikradisch/historic-place-name-locator 4 So far: Wikidata: https://www.wikidata.org/ [retrieved 07.01.2019], the Historic Gazetteer: www.gov. genealogy.net [retrieved 01.06.2019], Geonames: https://www.geonames.org/ [retrieved 20.03.2019] and Osmnames, a gazetteer based on Open Street Map: https://osmnames.org/ [retrieved 01.10.2019]. The combination of those different databases to one common is documented here: https://github.com/erikradisch/ historic-place-name-locator/tree/master/make-place-name-db 44 Erik Radisch 4 Historical Context The last feature of the historic place name locator is its ability to include historic boundaries within the search. Search within its historical context as already performed by the two examples of previous geo-resolution algorithms. Yet, Schürers algorithms depends heavily from a corresponding gazetteer, with this additional metadata. The overwhelming majority of historic projects do not have this advantage. The Edinburgh Geoparser on the other hand does provide the possibility to focus the search on a special bounding box, yet those are very inaccurate. The historic place name locator solves this problem by including shape files of historic place names within the constructed gazetteer. There is a constantly growing number of professional shape files of historic boundaries, which are available under open access. Some examples might be a map of all regions of Europe around 1900 (Mosaic5), the borders of the states of the German Empire (Mosaic, Harvard Geospatial Library6), the Empire and Kingdom of Austria-Hungary (Mosaic) and the Russian Tsar’s Empire (Ristat)7. The user only needs to connect the historic place names to the corresponding region in the shape file by providing a second column with the naming of the regions from the shape file (needless to say, that this step gets unfortunately labor intensive, if a lot of places have different historic contexts). The algorithm than favors results from this region. If the algorithm did not find a place in the historic region, it is possible to expand the search area step by step. For example, if a place might not be found in a historic region, for example Hessen-Nassau, a user can then let the algorithm search only in the German Empire and only in a third step in the whole world. Including a historical context can help to boost the accuracy of georeferencing historic place names greatly as it helps to exclude possible hits which are more unlikely due to their location. As historic boundaries are very often highly complex an automated search algorithm has a real advantage here, as it is often hard to say for humans, where exactly a historic region ended and another one began. 5 https://ehps-net.eu/databases/mosaic-project [retrieved 10.11.2019] 6 http://hgl.harvard.edu:8080/opengeoportal/ [retrieved 20.03.2019] 7 https://ristat.org/ [retrieved 10.05.2019] Automated Geo-resolution of Place Names in Historical Serial Sources 45 5 Evaluation mode The program has also an implemented evaluation mode, which enables the user to compare the matches to a gold standard. The algorithm produces HTML files of differing results on which a map is seen with the historic border (if given), the gold standard (green) and the result of the algorithm (red) as can be seen in figure 1. Fig. 1: A sample of the output of differing results. Note that the algorithm might have been right in this case as the hit is directly within the borders of Hessen-Nassau while the gold standard is only close by. (Source of the geospatial data: Germany Provincial Boundaries, 1871, German Historical GIS, online linkage: http://hgl.harvard.edu:8080/HGL/jsp/HGL.jsp?action=VColl&VCollName= GHGIS1914PROVINCES; Open Street Map, online linkage: https://www.openstreetmap.org) 6 Conclusion The historic place name locator still demands a considerable amount of preprocessing. A cleaning of the place names might still be necessary. Also the historic context of the place names has to be assigned to a polygon in a shape file. Nevertheless, the historic place name locator offers a generalizable solution for the geo-resolution of place names in serial sources, which also considers the exact historical context. By including the historical context of the place names, the algorithm achieves considerable accuracy. Several tests on a gold standard of around 500 human located place names of several different sources produced 46 Erik Radisch F-Scores close to 0.9. However, the good results should not distract from the fact that some problems remained despite the high allocation rate. The step-by-step search for example can also produce problematic hits. A case of this error susceptibility can be provided by the following example: A list of place names is given, which do have a historical context of several different regions of the German Empire. To keep the error rate low, it might be a good idea to include a second step search within the borders of the German Empire, if a place was not found within the borders of the assigned historical context. If a search algorithm looks first in a part of the German Empire, for example East Prussia, and does not find an exact match, it looks in a next step within the borders of the German Empire and might find a match in Breisgau which is close to France. A place with the exact name, very close to East Prussia yet the Russian Empire could have been excluded from the search. A human might consider the place close to Prussia much more plausible than the place close to France. The expansion of the search area in circles or bounding boxes might be more precise. The implementation of such an option is planed but not yet realized. References [Da64] Damerau, F.: A technique for computerdetection and correction of spelling errors. Communi- cations of the ACM, 3(7):659–664, 1964. [Ph00] Phillips, L.: The Double Metaphone Search Algorithm. In: Dr Dobb’s, June 1, 2000 https: //www.drdobbs.com/the-double-metaphone-search-algorithm/184401251?pgno=2, 2000. Online; retrieved 01.11.2019. [Po69] Postel, Hans Joachim: Die Kölner Phonetik. Ein Verfahren zur Identifizierung von Perso- nennamen auf der Grundlage der Gestaltanalyse. In: IBM-Nachrichten, 19. Jahrgang. pp. 925–931, 1969. [SPS15] Schürer, K.; Penkova, T.; Shi, Y: Standardising and coding birthplace strings and occupational titles in the British censuses of 1851 to 1911. Historical Methods, pp. 195–213, 2015. [Wi90] Winkler, W. E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research Methods. American Statistical Association. p. 354–359, 1990.