tag that holds the summary of the transcription (the text under the heading “Contents” in figure 2). The summary contains one or more paragraphs

, without any annotations of the place names, persons etc. In order to geocode the place names mentioned in the summary – which are essentially all the names that are also present in the entire text of the text – these must first be identified as place names, i.e. according to the TEI format. After that they could be matched with geodata.9 How this could be done is discussed in the following sections. 3 Geocoding historical places A place name points to a geographical location. Regarding medieval charters most place names refer to settlement units such as towns, manors, villages, hamlets and single farms, 9 Writing summaries of the medieval charters is an ongoing editorial activity at the Swedish National Archives. The letters from the years 1360–1380 have the most detailed summaries. Later on, more place names will be added to the summaries of the documents from other periods. henceforth referred as historical settlement units.10 There are several tools on the internet that could be used to mark a location on a modern map, as a point or area, and retrieve this data in any format, such as TEI or GeoJSON. It would thus be possible to add coordinates directly to the digital edition of a particular source material. However, a better approach would be to match place names with existing geodata, pro- vided by for instance GeoNames. Initiatives have also lately been taken to create open geo- data for historical sites. One such initiative is the Pelagios project, that connects several sets of geodata, for instance the Digital Atlas of the Roman Empire.11 Analyses of historical documents in relation to spatial data should have a great potential in several research areas. The administrative divisions of Denmark have been digitized within the DigDag project.12 Mapping and GIS techniques are also regularly used and adapted by researchers, for example, to analyze the routes taken by individuals (Storm et al, 2017). Spa- tial data from historical maps has enabled identification of old cultivated hops (Strese et al, 2009). Data extracted from historical maps is the basis for an identification of late-medieval deserted farms (Karsvall, 2016). The project TORA – Topographical Register at the National Archives in Stockholm – specifically defines the historical places in Sweden that existed from the Middle Ages un- til about the 18th century.13 The aim is to link digitized historical sources to well defined geodata. TORA includes place names and coordinates – in the form of points – for all set- tlement units that appears in the oldest land survey maps from the 17th century (Höglund, 2008; Tollin and Karsvall, 2010). Moreover, a large amount of settlements that appear in the Crown’s cadastres (jordeböcker) in the mid-16th century are also included.14 At present TORA holds about 22,000 spatial coordinates, which are published as linked data in RDF format.15,16 The coordinates in TORA are set at the actual location of the settlement, according to the oldest map available for each site. The accuracy is specified as high, medium or low. A large 10 A historical settlement unit refers to a place mentioned in an historical source, here the medieval charters. 11 http://commons.pelagios.org;http://dare.ht.lu.se 12 http://www.digdag.dk/ 13 http://riksarkivet.se/tora 14 Starting in the 1960s, the project Medieval Sweden (Det medeltida Sverige) extracts information from the medieval documents, as well as the economic provincial material of the 16th century (landskapshandlingarna). TORA includes coordinates of all settlements published in the book series of Medieval Sweden. So far 23 book volumes has been published, covering most parts of south-east Sweden: https://riksarkivet. se/medieval-sweden-dms 15 Resource Description Framework: https://www.w3.org/RDF/ 16 The historical settlement units are published in RDF format (a W3C standard) using serial numbers, e.g., data.riksarkivet.se/tora/1. TORA uses the EntryStore platform developed by MetaSolutions AB (http://entrystore.org/). For further information see http://riksarkivet.se/tora. number of abandoned settlements, not existing today, also appear in TORA, which is thus the most relevant authority file that can be linked to the place names in the medieval charters.17 3.1 An example This relatively typical case refers to a medieval charter from 1366, issued in Eskilstuna in south-eastern Sweden (SDHK 8953; see figure 2). The original letter, written in Latin, has been fully transcribed and interpreted.18 The place of issue, Eskilstuna, that already has a tag, can be assumed to correspond to the town of Eskilstuna. However, the town is of much later date (17th century). Most likely this charter has been issued at the monastery in Eskilstuna, which was established in the 12th century. The monastery was aban- doned in the 16th century and replaced by a castle on the same spot. The corresponding set- tlement units and coordinates for Eskilstuna mentioned in this charter would therefore much likely be ‘Eskilstuna Slott’, as defined in TORA using the oldest maps from year 1647.19 Eskilstuna is mentioned once more in the

tag in the digital edition in SDHK, as ‘Eskilstuna kloster’, which refers to the monastery. What would complicate an automatic match is the fact that ‘Eskilstuna kloster’ in more recent periods has become the name of the parish in this area. In GeoNames, for example, the coordinates linked to the name of ‘Eskilstuna kloster’ points to the parish area and not the location of the monastery.20 The same charter, SDHK 8953, holds one more place name, Torshälla, just north of Es- kilstuna. TORA has two registrations of Torshälla. One that refers to the town, which dates from the early 1300s, and one that refers to the neighboring settlement of a vicarage. It is not clear which of the two that is intended. In such cases it could be useful to make a temporary match that includes the other options available as alternatives.21 4 Towards automatic name tagging in SDHK As mentioned above, every document in the SDHK database has a human-composed sum- mary written in modern Swedish (labeled “Contents” in figure 2), although the language of the summaries sometimes has an archaic feel to it, especially when the original letter is in Old Swedish (and not in Latin as the example in figure 2). These summaries form the point of departure for the experiment described in this pa- per. Employing state-of-the-art natural language processing and information extraction tech- niques, the summaries can be used as a rich source of metadata for the underlying documents, 17 The method of how to set coordinates for settlement units using historical maps is the subject of a forthcoming article by the first author. 18 https://sok.riksarkivet.se/bildvisning/Sdhk_8953.jpg 19 https://data.riksarkivet.se/tora/2675; https://data.riksarkivet.se/tora/ 2675; https://sok.riksarkivet.se/bildvisning/R0000151_00049 20 http://www.geonames.org/8132118/eskilstuna-kloster.html 21 https://data.riksarkivet.se/tora/12463; https://data.riksarkivet.se/tora/ 2733 complementing the manually compiled metadata already accompanying them. At least the kinds of (legal) action, the involved individuals and other agents, and the involved locations, should be retrievable from the summaries with automatic methods. Given the aim expressed above of linking SDHK to the historical place name register TORA, in our first experiment described here, we apply a named-entity recognition (NER) system designed for modern Swedish to the summaries and evaluate its accuracy on data which – although written in the modern language – may contain many names which are no longer current. 4.1 Experimental setup For this experiment, 14 documents were sampled from the SDHK database and their sum- maries extracted, amounting to a total of approximately 1,500 words.22 All names of persons and places were manually identified and annotated in the text of the summaries, in order to have a gold standard dataset for evaluating the automated NER accuracy.23 For the NER we use a rule-based system developed over many years in Språkbanken at the University of Gothenburg (Kokkinakis et al, 2014). It has already proven its worth in some digital humanities applications, although with textual data of more recent date than in the present case (Borin and Kokkinakis, 2010; Borin et al, 2014). NER is one of the annotations available through Språkbanken’s Sparv infrastructure (Borin et al, 2016; see figure 3). Sparv exposes the linguistic annotation pipeline used as the first step in Språkbanken’s Korp corpus infrastructure (Borin et al, 2012). Sparv also makes available a corpus upload function for offline processing of larger vol- umes of text. In this way the 14 SDHK summaries were linguistically annotated. 4.2 Experimental results and discussion Table 1 shows the accuracy figures for the two types of names.24 We note that the recall for personal names is much better than that for place names, i.e., most personal names in the 22 The sample comprised documents SDHK 2382, 8846, 8847, 8861, 8887, 8896, 8901, 8913, 8931, 8953, 8954, 8955, 8959, 8970. 23 Using gold standard data for evaluating an automatic process is considered absolutely de rigueur in natural language processing, as the only way of obtaining an objective and reproducible measure of the accuracy of the result of the process. In the case at hand, this means that the automatic NER is applied to the same documents that were manually annotated but with the manual annotations hidden from the automatic process, and accuracy is then computed from a comparison of the NER annotations with the manual annotations. 24 The figures have been calculated so that partial matches and full matches are counted as equal, i.e., a NER place name match for Nöttja is considered a full match for the manually identified Nöttja socken ‘Nöttja Parish’. The measures in the last two columns of the table are the standard information retrieval evaluation measures of recall and precision. Recall is defined as the share of all true instances in the dataset that were actually retrieved, and precision as the share of all the retrieved instances that are true instances. Thus if a dataset contains 50 true instances and a retrieval method returns 40 hits, out of which 35 are true instances, the recall in this case will be 0.70 (35/50), whereas the precision will be 0.88 (35/40). Fig. 3. Språkbanken’s Sparv corpus annotation interface texts are in fact recovered by the system. Since the NER system relies primarily on gazetteers (name lists) for the recognition of both, a possible reason for this could be that many of the personal names used in the Middle Ages are still around, while place names have seen a higher replacement rate over the intervening centuries. Table 1. NER accuracy figures SDHK NER name type total correct false missed precision recall Place name 114 62 4 48 0.94 0.54 Personal name 100 92 8 8 0.92 0.92 The precision of the NER system used in this experiment is comfortably high for both name types, i.e., there are few false positives in both place and personal name recognition. This means that even out of the box, this NER system could provide considerable added value to the digital resources of the National Archives as research data sources. Finally, we note that the Sparv annotation pipeline is made up from independent tools of different origin. This means that the tools do not normally draw on each other’s annotations, something which is clearly reflected in the results of this experiment, if we take all annota- tions into account. Consequently, the named entity recognizer does not utilize the information from the part-of-speech tagger, which in fact tags 42 additional place names and 4 additional personal names as proper nouns (tag PM), but also gives 15 false positives, i.e., non-names tagged as proper nouns. This gives a recall rate (for names) of 0.82 and a precision of 0.75 and recall rates for place names and personal names of 0.88 and 0.50, respectively. Distributing the false positives proportionally between the two classes of names, we end up with estimated precision figures for place names and personal names of 0.75 and 0.80, respectively. The analysis, using NER, has so far been targeting the occurrences of the names. How- ever, place names could refer to several geographical locations. Common names of settle- ments, which are found in our survey material, are e.g. ‘Berga’, ‘Lundby’ and ‘Söderby’. A further geographical identification of the sampled data – the 14 medieval charters with al- together 144 place names – is therefore needed. For this purpose TORA will be used. The geographic accuracy in the NER system and the National Land Survey database (Lantmä- teriet, LM) will also be assessed. The result, presented in Table 2, is based on a semi-automatic method which has two steps: firstly, the place names in the XML documents of the medieval letters have been mapped with the name lists discussed here (TORA, NER and LM); secondly, a manual check has been conducted to evaluate if the matched names refer to the geographical locations mentioned in the documents. It should be emphasized that medieval charters are difficult to interpret and that spatial localization sometimes requires expert analysis. In our sample data, however, all names (with one exception) are normalized, written in modern Swedish. Table 2. Geocoding of place names in medieval charters source correct uncertain missed precision recall NER 41 25 48 0.62 0.36 TORA 46 6 62 0.88 0.40 LM 41 24 49 0.63 0.36 As shown in Table 2, TORA, NER and LM provide an equivalent result with a recall rate between 0.36 and 0.40. All three recognize the names of major towns. Also unique and unusual place names are matched with their geographical location. Foreign place names (25 in our sample data) could not be recognized by TORA or LM, as they are limited to the current borders of Sweden. Overall, TORA is the better recognizer. Especially since TORA has the most accurate coordinates of historical places as discussed before (see section 3). It does not cover general names of countries (as ‘Sweden’) and provinces (as ‘Uppland’). On the other hand, it includes all parishes and also other historical administrative divisions such as hundreds (härader). The strength of TORA is above all that it recognizes most settlement units, i.e. villages and hamlets. Only six names in this experiment are uncertain giving a precision of 0.88. In comparison, NER and LM show a geographical precision of 0.62 and 0.63. As shown in Ta- ble 1, NER successfully finds 62 place names. Of these 41 names can be assumed to point out the correct location. The uncertainty is mainly due to the occurrence of common place names as ‘Stenby’, ‘Lundby’, ‘Valla’ among others. With TORA – which includes adminis- trative divisions, such as parishes —-it is easier to determine the geographical location of a place name, even if it occurs frequently. The total amount of place names and coordinates in TORA is at present limited. About 50 percent of all historical settlement units are included so far. Names not yet included can, however, be added. As long as TORA is not complete NER and LM will be necessary complements. LM holds all official place names in Sweden today, but it does not cover historical settlements, e.g. disbanded place names and abandoned settlements. In conclusion, the results of this small experiment are promising, and there is also clearly room for improvement of the NER system. In order to make this technology useful in the present context, the next natural step would be to match recognized names to authority lists such as TORA that give the names an identity, and makes them suitable as linked data. This would serve to add meaningful identities to place names in the medieval documents, by pro- viding accurate coordinates that point to settlement locations specified in the oldest large- scale maps. 5 Conclusions Since the name information in the Swedish charter today lacks clear identifiers and spatial coordinates, it is difficult to make statistical calculations and any kind of spatial comparative studies. One solution to these problems would be to relate historical data of this kind to actual physical location by using coordinates. To a large extent, matching can be done automatically based on name. We investigated the feasibility of applying a named entity recognition system on the text of the SDHK document summaries, with quite promising results and some indications of how the NER system could be improved, and we intend to pursue this work further, including how to utilize the added metadata in user interfaces for researchers. This kind of ‘coordinate- based’ linked databases would present quite new conditions and possibilities within a num- ber of disciplines, e.g. history, historical geography, social history, philology, onomastics and archaeology. Irrespective of administrative divisions, data can be collected freely, and pre- viously unknown connections may appear. Large sets of data can be extracted intended for national surveys illustrating for instance regional differences. The scholars and users can de- vote their time to formulating research questions and making analyses instead of collecting and preparing the research material. Acknowledgements The research presented here is part of a project (TORA) funded by the Royal Swedish Academy of Letters, History and Antiquities, Riksbankens Jubileumsfond, and the Swedish National Archives. The work has also drawn on the e-infrastructure Swe-Clarin, funded by the Swedish Research Council (contract 2013-2003) and by the collaborating partner institu- tions forming Swe-Clarin (University of Gothenburg, The Institute of Language and Folklore, KTH, Linköping University, Lund University, The National Archives, Stockholm University and Uppsala University). References Borin L, Kokkinakis D (2010) Literary onomastics and language technology. In: van Peer W, Zyngier S, Viana V (eds) Literary education and digital learning. Methods and technolo- gies for humanities studies, Hershey, New York, pp 53–78. Borin L, Forsberg M, Roxendal J (2012) Korp – the corpus infrastructure of Språkbanken. In: Proceedings of LREC 2012, ELRA, Istanbul, pp 474–478. Borin L, Dannélls D, Olsson LJ (2014) Geographic visualization of place names in Swedish literary texts. Literary and Linguistic Computing 29(3):400–404. Borin L, Forsberg M, Hammarstedt M, Rosén D, Schäfer R, Schumacher A (2016) Sparv: Språkbanken’s corpus annotation pipeline infrastructure. In: SLTC 2016. The Sixth Swedish Language Technology Conference, Umeå University, Umeå. Höglund M (ed) (2008) 1600-talets jordbrukslandskap: En introduktion till de äldre ge- ometriska kartorna. Riksarkivet, Stockholm. Jockers ML (2013) Macroanalysis: Digital Methods and Literary History. University of Illi- nois Press, Urbana/Chicago/Springfield. Karsvall O (2016) Utjordar och ödegårdar. En studie i retrogressiv metod. SLU, Uppsala. Keim DA, Kohlhammer J, Ellis G, Mansmann F (2010) Mastering the information age – Solving problems with visual analytics. Eurographics, Goslar. Kokkinakis D, Niemi J, Hardwick S, Lindén K, Borin L (2014) HFST-SweNER: A new NER resource for Swedish. In: Proceedings of LREC 2014, ELRA, Reykjavik, pp 2537–2543. Moretti F (2005) Graphs, Maps, Trees: Abstract Models for a Literary History. Verso, Lon- don/New York. Moretti F (2013) Distant reading. Verso, London/New York. Schöch C (2013) Big? Smart? Clean? Messy? Data in the humanities. Journal of Digital Humanities 2(3). Storm I, Nicol H, Broughton G, Tangherlini TR (2017) Folklore tracks: Historical GIS and folklore collection in 19th century Denmark. In: Golub K, Milrad M (eds) Digital Human- ities 2016, CEUR-WS.org, Aachen, pp 75–98. Strese EM, Karsvall O, Tollin C (2009) Inventory methods for finding historically cultivated hop in Sweden (Humulus lupulus L.). Genetic Resources and Crop Evolution 57(2):219– 227. Tollin C, Karsvall O (2010) Sveriges äldre geometriska kartor. Bebyggelsehistorisk tidskrift 60:94–103.