1. Introduction

The Past is a Foreign Place: Improving Toponym Linking for Historical Newspapers

Mariona Coll Ardanuy

mcoll@prhlt.upv.es 1 2 3

Federico Nanni

fnanni@turing.ac.uk 2 3

Kaspar Beelen

kaspar.beelen@sas.ac.uk 0 2 3

Luke Hare

lhare@turing.ac.uk 2 3 0 Digital Humanities Research Hub, School of Advanced Study , Senate House, London , United Kingdom 1 PRHLT Research Center, Universitat Politècnica de València , València , Spain 2 The Alan Turing Institute , British Library, London , United Kingdom 3 Work conducted while at The Alan Turing Institute

2023

368 390

In this paper, we examine the application of toponym linking to digitised historical newspapers. These collections constitute the largest trove of historical text data available to researchers in the humanities. They contain varied, 昀椀ne-grained information about the past, anchored in a speci昀椀c place and time. Place names (or toponyms) are common entry points for starting exploring these collections. In this paper, we introduce a new tool for toponym linking and resolutionT,-Res, a modular, 昀氀exible, and opensource pipeline, which is built on top of robust state-of-the-art approaches. We present a comprehensive step-by-step examination of this task in English, and conclude with a case study in which we show how toponym linking enables historical research in the digitised press.

eol>toponym resolution entity linking historical newspapers nineteenth-century toponym linking

1. Introduction

newspaper texts. In this paper, we focus on the speci昀椀c task of identifying and linking geographical named entities (i.e. toponyms) in historical newspapers in English. This task presents certain additional challenges to standard EL, as illustrated with the following news fragments: • CxitiTCHUßCit, June 10—Yesterday being the day appointed for the election of taro gentlemen to tepcoeot this borough in the new imperial Parliament [...].1 • Leghorn, April 6. ETTERS from Condantinopte, dated March 3, mention, tliat an Earthquake had lately hapL 3 pened at Tauris, the Capita! of the Province of *IS )ra Ariherbigan, in Pcr昀椀a. 2 Most visible are the errors introduced during digitisation and optical character recognition (OCR). Such errors can occur both in the named entity (sometimes even rendering it incomprehensible for the human reader) or in the context of the entity. Secondly, historical newspapers portray a world that has changed, while at the same time being o昀琀en very regional in their focus [ 20 ]: in the 昀椀rst example, ‘CxitiTCHUßCit’ (i.e. Christchurch) refers to the town in Dorset—which would have been the 昀椀rst reference of the readers of The Dorset County Chronicle—instead of the (today) more well-known city in New Zealand. Despite their strong regional focus, most publications also covered international news, re昀氀ecting a state (and vision) of the world that has changed: notice the use of the toponyms‘Leghorn’ for Livorno,‘Condantinopte’ (i.e. Constantinople) for Istanbul, and‘Tauris’ for Tabriz, capital of‘*IS )ra Ariherbigan’ (i.e. East Azerbaijan) in ‘Pcr昀椀a ’ (i.e. Persia, modern Iran).

In this paper, we perform a comprehensive step-by-step examination of toponym linking in the historical newspapers domain in English. As a result of this analysis, we presentTRes,3 a new tool for toponym linking and resolution of historical newspapers in English, built on top of existing robust technologies, such astransformers [ 49 ] for 昀椀ne-tuning a BERT language model for named entity recognition 1[ 1 ]; DeezyMatch [ 23 ] for candidate selection; and the work of Le and Titov [ 27 ] and Ganea and Hofmann [17] for entity disambiguation, via the Radboud Entity Linker (REL) implementation 2[ 4 ]. T-Res has been developed to assist researchers explore large collections of digitised historical newspapers, and has been designed to tackle common problems of working with these data. It is implemented as a modular pipeline, and is both user-friendly and 昀氀exible, where the user can either provide their own resources and datasets and train their own models, or they can load existing models. We conclude our paper with a preliminary but realistic case study in which we showcase how T-Res can be used to support historical research.

2. Related Work

Entity linking (EL) is o昀琀en treated as a three-step process: (1) named entity recognition (NER) is the task of detecting mentions, (2) candidate selection (CS) is the task of selecting a subset of potential referents from a knowledge base (KB) for the detected mentions, and (3) entity disambiguation (ED) 昀椀nds the best match, if any, from the pool of selected candidates. EL benchmarks in English consist mainly of texts from the general domain, which mostly feature 1The Dorset County Chronicle, 1864-04-14. 2The Manchester Mercury, 1780-05-30. 3https://github.com/Living-with-machines/T-Res. prominent entities [48]. Therefore, tools that perform well on such datasets, are o昀琀en found to deteriorate in other domains, such as on historical documents3[ 8, 42, 36, 32 ]. The HIPE 2020 shared task4 [14] was created to address some of the EL challenges that are speci昀椀c to digitised historical documents.

Historical digitised data has certain traits that are typically absent from standard EL benchmarks [13]. The presence of OCR errors is a persistent problem. In their assessment of the impact of OCR in downstream tasks, Strien, Beelen, Coll Ardanuy, Hosseini, McGillivray, and Colavizza [46] and Hamdi, Jean-Caurant, Sidère, Coustaty, and Doucet 1[ 9 ] observe how NER performance decreases as text quality declines. The results of the HIPE-2020 shared task 1[ 4 ] (and its continuation HIPE-2022 [ 15 ]) point to the importance of having in-domain training data for NER, suggesting that 昀椀ne-tuning on noisy data results in better performance on similarly noisy data. Similarly, Manjavacas and Fonteyn [ 31 ] show how NER models perform better when they have been 昀椀ne-tuned on top of base models which were originally pre-trained on indomain data, in this case, historical digitised texts. González-Gallardo, Boros, Girdhar, Hamdi, Moreno, and Doucet [ 18 ] evaluated the performance of OpenAI’s ChatGPT on the task of detecting (in a zero-shot manner) named entities in historical documents, revealing that, similarly, ChatGPT struggles with identifying entities in OCR text5.

Candidate selection is the least studied of the three sub-tasks. The identi昀椀cation of potential candidates from the KB (usually based on collaboratively-built resources such as Wikipedia, Wikidata, of Freebase) has traditionally been approached by performing exact or partial string matching between a mention and the entries in the KB [ 33, 45 ]. Since most popular EL benchmarks consist of very clean text, this step does not o昀琀en pose an obstacle for achieving a good EL performance. In other words, plain string matching goes a long way. However, when working with noisy text, basic string matching is far from su昀케cient [ 51 ]. In the domain of digitised historical newspapers, Linhares Pontes, Cabrera-Diego, Moreno, Boros, Hamdi, Doucet, Sidere, and Coustaty [ 29 ] propose a series of pre-processing heuristics used in combination with a postcorrection step, based on mappings of common OCR errors observed in the data. Traditional fuzzy string matching techniques based on edit distance (such as Levenshtein) can deal quite accurately with OCR text, but they are not a viable solution for real-time EL, since they are computationally ine昀케cient [ 10 ]. DeezyMatch [ 23 ], a so昀琀ware library for neural fuzzy string matching, was developed as a response to this problem, building on Santos, Murrieta-Flores, Calado, and Martins [ 43 ].

The last step of the pipeline is a disambiguation task, consisting of selecting the most appropriate entity from the pool of previously selected potential candidates. The entity disambiguation literature o昀琀en distinguishes between local models—which rely only on the mention’s context and the entities’ priors, o昀琀en based on hyperlink counts from large resources such as Wikipedia [ 8, 35, 34 ]—and global models—which take interdependencies between entities into account [ 41, 25, 27 ], with the more recent approaches learning deep representations for relations between entities and mentions. In the domain of historical newspapers, Boros, Pontes, Cabrera-Diego, Hamdi, Moreno, Sidère, and Doucet 6[] and Linhares Pontes, Hamdi, Sidere, and Doucet [ 30 ] build on these approaches, and emphasise the importance of good knowledge 4https://impresso.github.io/CLEF-HIPE-2020/. 5Our own experiments with ChatGPT were not more successful. representation.

EL pipelines encapsulate all steps in one toolkit. DBpedia Spotlight3[3] and TagMe! [ 16, 40 ] are two of the 昀椀rst and most widely used out-of-the-box linkers. More recently, REL [ 24 ] was developed to overcome some of the shortcomings of previous systems, building on state-of-theart approaches. REL uses Flair [ 1 ] for recognition. For candidate selection, just like most other state-of-the-art approaches, REL employs a series of string-based heuristics to 昀椀nd potential candidates, which are ranked according to a combination of entity priors and a measure of similarity between the entity and the context of the mention, as in Ganea and Hofmann1[ 7 ], using Wikipedia2Vec’s [ 50 ] word and entity embeddings. The local coherence between mention and entity is computed as de昀椀ned in Ganea and Hofmann [17] and uses the global disambiguation strategy proposed in Le and Titov 2[ 7 ]. REL is fast, user-friendly, easily customisable, and very well documented, including tutorials, examples and a running API6.

3. Terminology and Task Definition

In this paper, we use the terms toponym linking and (geographic) entity linking interchangeably to refer to the end-to-end task of detecting mentions of places in texts and linking them to their referent in a knowledge base.7 Formally de昀椀ned, given a document , the goal is to detect mentions of places 1, ..., and resolve them to their corresponding entitie s1, ..., in a knowledge base (KB). This is achieved in three steps. The 昀椀rst step, called toponym recognition or (geographic) named entity recognition, consists of detecting mentions of places 1, ..., in a document . The second step is candidate selection, which, for a given mentio n , aims at selecting a subset of potential entities = ( 1 , ..., ) from the KB. The last step is entity disambiguation, which, given the set of candidates for mention , consists of selecting the candidate that is the correct entity for mention , or return NIL if there is none. Finally, we de昀椀ne toponym resolution as the task of retrieving the coordinates of the predicted entities.

4. Experiments 4.1. Knowledge Base

As is commonly done in entity linking, we have used Wikimedia resources (in this case, Wikipedia in combination with Wikidata) as the starting point for our KB. For each Wikipedia page (herea昀琀er ‘entity’), we extracted all ways of referring to it over the entire Wikipedia collection by means of the anchor texts of the hyperlinks pointing to the page (herea昀琀er ‘mentions’). We then mapped Wikipedia entities to Wikidata, and kept only the subset of entities that are 6See: https://github.com/informagi/REL. While more recent approaches have now surpassed it on the leaderboard, it is still positioned near the top, according tohttps://paperswithcode.com/task/entity-linking. 7Detecting and resolving mentions of places to their real world referents is a research problem shared by two di昀erent tasks: (1) Entity Linking, the task of linking named entities to their corresponding entries in a knowledge base, and (2) Toponym Resolution, also called geoparsing, the task of resolving place names to their spatial footprint, o昀琀en their geographic coordinates [ 28 ]. Because of these slightly di昀erent objectives, both tasks are rarely evaluated jointly. We treat this problem as an EL task, in part because linking to Wikidata (instead of directly providing coordinates) gives the user access to other linked information. geolocated on Wikidata8. The resulting subset consists of 929,855 geolocated entities. In addition, for each entity, we keep the absolute and normalised mention-to-entity frequencies of all its mentions. Mention-to-entity frequencies are normalised per entity: for example, the settlement named ‘London’ in Kiribati has an absolute mention-to-entity count of 13 and a normalised frequency of 0.81 (the mention ‘London’ refers to the location in Kiribati 13 times, but the probability of London in Kiribati being referred to as ‘London’ is 0.81).

4.2. Datasets

We performed experiments on two di昀erent digitised historical newspaper datasets in English: • TopRes19th (henceforth lwm): This dataset was created by the Living with Machines project [ 9 ].9 In its latest version (v2), this dataset consists of 455 news articles in which places were manually annotated and linked to Wikipedia (which we have mapped to Wikidata). The news articles in this dataset were selected from local or regional newspapers based in di昀erent locations in England (Manchester, Ashton-under-Lyne, Poole and Dorchester), published between 1780 and 1870. In the dataset, toponyms are classi昀椀ed as ‘BUILDING’, ‘STREET’, ‘LOC’, ‘ALIEN’, ‘FICTION’, and ‘OTHER’, but the last three were found to occur between zero and 昀椀ve times in the whole dataset, therefore resulting negligible for training purposes. The dataset is split into training and test sets (343 and 112 articles respectively). We used 20% of the training set for development. • Hipe2020 (henceforth hipe): This dataset was created by the Impresso project with data from the Chronicling America project, and was released as part of the HIPE2020 evaluation campaign on named entity processing on historical newspapers1[ 4 ].10 It consists of news articles in English, French, and German. The English collection, which is the one we use, consists of 125 articles from 14 di昀erent newspapers (based in 14 di昀erent locations in the United States) published between 1790 and 1960. The named entities are manually identi昀椀ed and linked, whenever possible, to their corresponding Wikidata entity. While the dataset has other entity types (such as ‘person’ or ‘organisation’), in our experiments we consider only entities of the type ‘location’. This dataset does not have a training set, it is instead split into a development and test set (80 and 46 articles respectively).

4.3. Approaches 4.3.1. Named entity recognition

We 昀椀ne-tuned a BERT model for token classi昀椀cation, using the lwm training set. We used a historical BERT model as base,bert_1760_1900 [ 22 ], trained on books in English published 8We used the 2021-10-20 English Wikipedia version and the2022-07-28 Wikidata version. We relied on the WikiExtractor (https://github.com/attardi/wikiextractor) tool to extract the content of each page from the Wikipedia XML dump, usedWikiMapper (https://github.com/jcklie/wikimapper) to map Wikipedia titles to Wikidata QIDs, and used the Wikidata propertyP625 to 昀椀lter out non-geographic entities. 9The lwm dataset is available at https://doi.org/10.23636/r7d4-kw08(CC BY-NC-SA 4.0). Newspaper data was provided by Findmypast Limited from the British Newspaper Archive, a partnership between the British Library and Findmypast: https://www.britishnewspaperarchive.co.uk/. 10The hipe dataset is available at https://zenodo.org/record/6046853(CC BY-NC-SA 4.0). between 1760 and 1900.11 To 昀椀ne-tune for toponym recognition, we used a learning rate of 0.00005, a batch size of 8, 10 epochs, and weight decay of 0.001.2 We perform a series of postprocessing steps to 昀椀x obvious mistagging errors: we corrected I- labels at the beginning of a new entity, removed innerI- tags due to nested entities, and 昀椀xed pre昀椀x assignment errors in hyphenated entities.

4.3.2. Candidate selection

Our tool provides two main di昀erent strategies for candidate selection1:3 one is based on exact matching, where candidates are retrieved from the KB if they are identical to the query; and the other is based on fuzzy string matching, using a deep learning approach to fuzzy string matching, DeezyMatch [ 23, 10 ] in a new fashion, which we expand on in the following paragraphs.

DeezyMatch for candidate selection DeezyMatch learns string transformations from a large set of positive and negative example pairs (e.g. both ‘Zuiich’ and ‘7urich’ are positive examples of OCR variations of ‘Zurich’, whereas ‘Munich’ is not). A model trained from these examples is then used to embed both (1) the query and (2) all name variations in the KB into vector representations. Candidate string variations are retrieved from the KB and ranked according to the similarity between their embedding representations and the query embedding.

We propose a new approach for generating positive and negative example pairs when large volumes of noisy text are available. We observed an interesting di昀erence between static word embeddings learnt from clean text and learnt from OCR text. In the 昀椀rst case, as explained by the distributional hypothesis, the top nearest neighbours of a query tend to be words that are semantically similar. However, when word embeddings are trained on OCR text, many of the top nearest neighbours are OCR variations of the query. We used this observation to build a dataset of positive and negative matches from word2vec embeddings learnt from digitised English newspapers, where: • If the string similarity between the nearest neighbour and the target word is high (such as ‘maciiine’ and ‘machine’) and the nearest neighbour is not an existing word in English, we consider it an OCR string variation of the target word (i.e. positive example); • If the string similarity between the nearest neighbour and the target word is low (such as ‘maciiine’ and ‘device’), we consider it is not a string variation, as it is probably a synonym or near-synonym (i.e. negative example). 11The model is available athttps://huggingface.co/Livingwithmachines/bert_1760_1900. 12We selected these values based on previous research that performed a hyperparameter search for the same task and a di昀erent base model [ 44 ]. See more information athttps://github.com/dbmdz/clef-hipe/blob/main/experi ments/clef-hipe-2022/. The resulting toponym recognition model is available athttps://huggingface.co/Livingw ithmachines/toponym-19thC-en. 13We also provide functions for performing partial matching based on string overlap and fuzzy string matching based on the Damerau-Levenshtein edit distance. However, both methods are highly time-consuming, and therefore unusable for real-time scenarios. We used openly-available word embeddings trained on digitised newspaper text from four different decades (1800s, 1830s, 1860s, and 1890s).14 Table 1 shows examples of positive and negative string matches generated with this approach. We expanded the resulting string pairs dataset by appending similar variations of place names obtained from our KB mention-to-entity mapping.15 The resulting dataset consists of 1,085,514 string pairs. Candidate ranking and selection Given a query, the candidate selection step retrieves one or many potential name variations from the KB. In the exact match approach, only one name variation is retrieved (i.e. the identical match) with a similarity score of 1.0. In the DeezyMatch approach, the user can choose the number of name variations to retrieve and set the maximum accepted distance between the embeddings of the query and the KB mentions1.6 The similarity score for each of the retrieved name variations is obtained from reverse-normalising the distance score against the threshold. Each name variation is then expanded to multiple Wikidata entities (i.e. candidates), using the mention-to-entity mapping from our KB17.

4.3.3. Entity disambiguation

The last step consists of 昀椀nding the most likely entity from the pool of selected candidates for a given query. We provide two dummy baselines: the 昀椀rst baseline ( mostpopular) selects the candidate which is more likely to be referred by a certain query, using the mention-to-entity absolute counts described in section4.1. The second baseline (bydistance) is based on distance from the place of publication, in which the closest candidate from the place of publication is naively selected as the correct entity. Finally, our tool adopts REL’s entity disambiguation 14The word embeddings are available athttps://doi.org/10.5281/zenodo.7181682(CC BY 4.0) [39]. For example, the nearest neighbours of ‘machine’ in word embeddings trained from digitised newspaper articles published in the 1860s are: ‘machines’, ‘maehine’, ‘maciiine’, ‘machina’, ‘maohine’, ‘achine’, ‘miachine’, and ‘maohine’. We used the vocabulary of the 50d GLoVe embeddings to discern whether a word exists in English. Further details can be found in our GitHub repository. 15For example, the Wikidata entry Q7268098 is referred to as both ‘Qoorlugud’ or ‘Qorilugud’: they would be added as positive variations of each other. 16We selected one name variation per query, with an L2-norm similarity threshold set at 50. 17For example, given the query ‘Wiltshire’, the exact match approach would retrieve the mention ‘Wiltshire’ from the KB with a similarity score of 1.0, which would be expanded to the following Wikidata entities: Q23183, Q55448990, and Q8023421, since all of them are referred to as ”Wiltshire” at least once in Wikipedia anchor texts. implementation18 (rel), which is based on Ganea and Hofmann [17] and Le and Titov [ 27 ], and uses a neural approach to combining local mention-to-entity compatibility and global entity-toentity coherence. We provide our own set of candidates (selected either with theexact or deezy match approach), which we pre-rank by averaging the string matching score and the relative mention-to-entity score, and the normalised absolute mention-to-entity score. We additionally provide the following two alterations to the REL disambiguation approach: • Providing information about the place of publication: (+publ) Since we are aware of the strong local emphasis of the historical press, we experiment with arti昀椀cially providing information on the place of publication (which is o昀琀en available from the newspaper’s metadata) to the disambiguation module: we do so by adding one additional already-disambiguated entity per sentence, both in training and in testing, corresponding to the place of publication, and adding the publication place name also as part of the context of the sentence. • Unlinking micro locations: (:nil) Streets and buildings have rather di昀erent characteristics than other locations typically found in news articles: they are o昀琀en highly ambiguous and o昀琀en entirely dependent on the cues provided by context. At the same time, they have a very limited coverage in Wikipedia (where only the most noteworthy streets or buildings are usually included). In this variation of the original method, only the LOC entities are disambiguated, whereas mentions classi昀椀ed as BUILDING or STREET are linked to NIL.

5. Evaluation and Discussion

In this section, we report and discuss the results assessed using the HIPE-scorer1.9

5.1. Toponym Recognition

18We used the REL version at commit9ca253b. We use the Wikipedia2Vec [ 50 ] word and entity embeddings shared by the authors, mapping them to Wikidata entities instead of Wikipedia titles. 19The HIPE-scorer (https://github.com/hipe-eval/HIPE-scorer, v1.1) is a Python module developed as part of the

CLEF-HIPE-2020 evaluation campaign on named entity recognition and linking on historical newspapers. 20REL’s default approach to recognising named entities in text uses Flair’s character-level sequence tagger 1[], which is trained and evaluated on CoNLL-2003 data. Since REL tags not only locations, but also persons and organisations, we keep only those entities which are tagged as ‘LOC’ or whose prediction is an entity in our KB.

REL returns a Wikipedia title, which we turn into a Wikidata QID. 21To learn more about the neural baseline and participating teams and approaches, read Ehrmann, Romanello, Najem-Meyer, Doucet, and Clematide [ 15 ]. We only provide results for the ‘LOC’ tag forhipe because this dataset includes non-geographic entities. It is worth noting that there may be slight di昀erences between the datasets used by us and those used in the shared task, because they have undergone di昀erent preparation steps. These di昀erences are probably too small to be signi昀椀cant. • Strict: exact boundary match, same entity type.

• Type: at least one token overlap, same entity type.

Note the considerable di昀erence between the strict and type settings in all cases,22 the latter re昀氀ecting the correct identi昀椀cation of a mention’s presence while not agreeing on the exact named entity boundary. While the poorer performance of the out-of-the-box REL tool is not per se surprising (given that it has not been optimised for digitised historical text) the di昀erence is substantial nonetheless. This experiment in fact highlights how, already at the recognition stage, there is a di昀erence of around 15%–23% in terms of exact F1 between the out-of-the-box state-of-the-art method and a tool carefully attuned to the speci昀椀c application domain. This is signi昀椀cant, since errors introduced in this step will percolate through the rest of the pipeline. lwm hipe all loc street building loc

T-Res aauzh neurbsl T-Res rel T-Res T-Res T-Res aauzh neurbsl l3i rel

5.2. Candidate Selection

In table 3, we report the highest possible performance that can be achieved during linking (i.e. skyline) by the di昀erent candidate selection strategies. In other words, a skyline true positive is when the correct entity has been selected as a potential candidate for a particular mention. The skyline, therefore, can be considered as a proxy for the quality of the di昀erent candidate selection approaches. We provide two evaluation settings: end-to-end EL (where mentions are identi昀椀ed using the best performing NER approach), and EL-only (where gold standard mentions are provided). In both cases, we report micro-scores using the ‘type’ evaluation setting, as 昀椀nding the exact boundaries of the named entity is now not the goal. The results show the advantages of having a fuzzy string matching method (i.e.deezy), which in this case is trained on corpus-speci昀椀c OCR variations, similar to those present in both datasets. The lower performance on hipe in the end-to-end EL setting is mostly just the consequence of the a worse toponym recognition in the previous step. 22The smaller di昀erence for our tool’s performance on lwm is expected since it uses thelwm training set for NER. lwm hipe

approach T-Res:exact T-Res:deezy T-Res:exact T-Res:deezy

5.3. Entity Disambiguation

We report the results for the entity disambiguation step in Tables4 and 5. The scores highlight how a very simple baseline—the combination of perfect match (at the selection stage) and most popular (at the disambiguation stage)—achieves a higher performance than the out-of-the-box REL system in the lwm dataset (but not onhipe), emphasising again the importance of a domainspeci昀椀c module at the recognition stage. 23 The distance-based baseline, on the other hand, performs very poorly on both datasets. On thelwm dataset, the REL disambiguation approach (used as part of our tool), beats themostpopular baseline, but not so in thehipe dataset, showing that the most common sense continues to be a very strong baseline. In both cases, interestingly, forcing streets and buildings to be ‘NIL’ results in a substantially higher performance. This suggests that most of these entities must be of type ‘NIL’ in the data (i.e. either too ambiguous to annotate or not present in the KB), but also that the disambiguation approach may not be suitable for these entities. While adding the place of publication has a positive impact on the hipe dataset, the impact on the lwm dataset is less clear.

Finally, we inspected more closely how the performance of our approaches vary based on several characteristics of our data. We split thelwm dataset into ten di昀erent subsets, each a unique combination of the decade in which the texts were written and the place of publication of the newspaper. We then performed a 10-fold validation of our results, where, in each fold, one subset was used for testing, another one for development, and the remaining eight subsets were used for training.24 Detailed results are shown in Table6.

First of all, we see a correlation between worse OCR quality (corresponding to the earlier data splits) and a lower skyline and linking performances. Second, there seems to be a correlation 23Note that there are other factors that should also be taken into account. REL returns Wikipedia titles, which we mapped to Wikidata IDs, keeping only those results that are tagged as ‘LOC’ or can be mapped to geographic coordinates. However, it should be noted that REL uses its own KB, consisting not only of locations, therefore making the disambiguation a more di昀케cult task since it is not only geographical entities that compete in the disambiguation process, but entities of any kind. It may be worth investigating, as part of future research, the impact this has on end-to-end EL. In providing this comparison, our goal is to illustrate the impact of using a general purpose EL system for this task, and stress the importance of developing tools that are targeted to the speci昀椀c task and domain. 24The ten subsets by publication are: Ashton-under-Lyne 1860, Dorchester 1820, Dorchester 1830, Dorchester 1860, Manchester 1780, Manchester 1800, Manchester 1820, Manchester 1830, Manchester 1860 and Poole 1860. between the proportion of NILs and a lower median distance from publication, which is not entirely justi昀椀ed by a high presence of micro locations (i.e. streets and buildings), suggesting either (1) a higher di昀케culty for human annotators of 昀椀nding the true referents of local mentions, or (2) the absence of local entities in the knowledge base. It is therefore not surprising to see rel:nil signi昀椀cantly improving on mostpopular in these cases, since the 昀椀rst maps buildings and streets to NIL. However, a closer inspection of the results also reveals the importance of the sensitivity of rel:nil (and rel+publ:nil) to context: for example, while “Ashton” is consistently resolved to be in Maryland by the mostpopular approach, it is in all but one case resolved adequately to Ashton-under-Lyne by the REL-based approaches.

5.4. Discussion and Limitations

Our research has focused on the geographic aspect of entity linking. However, T-Res could directly be used for general entity linking as well, with the exception of thebydistance linking Manchester Poole method, given a suitable knowledge base. We have focused on digitised historical newspapers, but T-Res could in principle be generalisable to other domains, possibly depending on additional annotated in-domain data. Further research is needed in this direction.

Our tool can be improved in many directions: each of its modules (named entity recognition, candidate selection, entity disambiguation) is open to improvements: for example, we can include more sophisticated NER 昀椀ne-tuning strategies [ 7 ]. Computationally, however, the NER step is the clear bottle-neck of our tool: for example, it took about 90 minutes to recognise all toponyms in a sample of about 1,500 articles (4.2M of plain text) on a CPU, while the candidate selection and entity disambiguation steps (usingdeezy and rel:nil+publ) jointly took about three minutes.

By looking at our results from a more qualitative perspective, we realise that many of our errors stem from the KB itself. This is not surprising: not only does our KB mainly contain modern entities, it also uses modern relations between entities (via word and entity embeddings) as a way to represent their historical similarity. In the same vein as recent research 2[ 6, 5 ], we believe further research should go into understanding the impact of using domain appropriate entity embeddings in the disambiguation step, for example by training embeddings which take into account time and space.

Finally, another source of errors is the use of DeezyMatch for fuzzy string matching: while it allows us to e昀케ciently discover entities which would otherwise have remained hidden, such as ‘Ashtonnnder-Lyne’ for ‘Ashton-under-Lyne’ or ‘Horbury Junotion’ for ‘Horbury Junction’, its precision is lower than that of a traditional edit distance approach1[0], sometimes resulting in what is called hallucinations in today’s AI jargon. For example, ‘Vieillevigne’ is matched to ‘Vielle Montaigne’. We therefore suggest combining the fast discoverability power of DeezyMatch with a more conservative edit distance approach to 昀椀lter obvious mismatches.25 Finally, linking of micro locations is another direction that clearly requires further research.

6. Historical Case Study: Geographies from Below?

In this section, we present a case study as a type of user-testing and to assess how T-Res supports novel historical research on the digitised press. We explore the shi昀琀ing geographies in Victorian working-class newspapers, analysing their local, national and transnational dimensions. For that, we have used the openly available British newspapers digitised by theLiving with Machines project [ 47 ].26 Motivated by a need to counterbalance the dominance of the liberal and conservative press with non-elite perspectives “from below” 4[], the project prioritised the digitisation of “plebeian” newspapers channelling working-class voices, and selected exclusively ‘provincial’ papers to help research move beyond the traditional metropolitan emphasis in periodical research [ 21 ]. The corpus is not representative of the press (or society) as a whole, but it provides a solid starting point to explore the geographies embedded in the popular, working-class papers. Our case study is based on a sample of more than 2,500 complete issues published between 1880 and 1900. This resulted in a set of 2.7 million detected toponyms. In the experiments, we kept only those georeferenced places classi昀椀ed as location (‘LOC’), which resulted in a collection of 1,770,412 data points comprising 46,820 distinct georeferenced locations. Figur1e shows 25In our case studies, we have applied a threshold of 0.85 edit distance similarity ratio between the mention and the returned DeezyMatch candidate, using theTheFuzz library: https://github.com/seatgeek/thefuzz. 26https://livingwithmachines.ac.uk/over-half-of-a-million-pages-of-historical-newspapers-now-openly-available / the global distribution of these unique toponyms. Below, we explore the places mentioned in the news across three di昀erent levels: the local, the national and the transnational. Firstly, we investigate to what extent the coverage of these late-Victorian newspapers was limited to the national border (of the United Kingdom, which includes today’s Republic of Ireland). Secondly, we analyse whether these provincial titles emphasised local events or increasingly attended to metropolitan news. Thirdly, we have a closer look at transnational aspects, more precisely the presence of popular imperialism in these working-class newspapers.

Turning to the 昀椀rst question, Figure 2 shows the proportion of British (the orange bar) versus non-British places (the blue bar) between 1880 and 1900. While there is no dramatic change over these two decades, it shows a decrease in foreign place names (admittedly, a small drop), from 25% in the early 1880s to 23% around 1900. Put di昀erently, more than 75% of all mentions comprise British place names. Taken together, these results may be more remarkable in their stability: while news reporting is o昀琀en driven by unexpected events, on average, attention to what is happening outside the borders of the United Kingdom remained more or less unchanged.

Newspapers played a critical role in shaping and upholding the nation as an imagined community [ 2 ]. But, even though the previous analysis shows that the press was, discursively, 椀昀rmly anchored in British soil, this doesn’t necessarily imply that it was “national” in its scope. In his extensive study A Fleet Street in Every Town, the historian Andrew Hobbs concedes that the Victorian reader generally preferred local news and that the press played a critical role in forging local communities [ 20 ]. Nonetheless, while the idea of a “national press” was still only emergent, these provincial papers were far from isolated entities. Hobbs, at the same time, underlined the networked character of the Victorian provincial press. While newspapers o昀琀en served as chroniclers of local culture and events, they did so as local nodes in a wider, national network of information. Put di昀erently, the provincial press was “a ‘national’ [network] made from many ‘local’ elements”, and London 昀椀gured as a central node in this network. This suggests that the provincial press was far from being parochial.

To better understand how these newspapers meandered between the local and national level, we scrutinise the distribution of toponyms situated in England, Wales, Scotland and Ireland. For Figure 3, we 昀椀rst computed the distance between each toponym and the newspaper’s place of publication.27 We divided these mentions into di昀erent bands based on their proximity to the place of publication. For example, the blue band shows the proportion of places names which were less than 25km removed from the district where the paper was produced. Interestingly enough, the geographical coverage of these papers tended to shrink. Changes remain small, but we do observe an increasing emphasis on more local matters (again, very rudimentarily measured as events taking place near the place of publication). On average, the proportion of toponyms in the blue band increases by roughly 5 percentage points over these two decades. To assess the dominance of the metropolis in the provincial press, we calculated the distance of each toponym to London2.8 While London was very present indeed, it was not as dominant as expected: less than 20% of all the toponyms were located in or around London. Most surprisingly, the number of mentions seems to decline over time, which ties in with our earlier 椀昀nding that suggested a narrowing of the geographic horizon of these late Victorian titles. 27We used historical press directories to determine the place of publication. For more information on the press directories, see [ 4 ]. 28We looked at places less than 25km removed from the coordinates as reported on Wikidataht(tps://www.wikida ta.org/wiki/Q84).

When comparing how individual papers vary in their emphasis on the local event, the differences become more pronounced, as shown in Figure4. This allows us to understand and classify how these titles di昀ered in terms of their geographical reach and coverage. Some of these provincial periodicals had a distinctively local emphasis. For example, close to 60% of all places in the The Birkenhead News and Wirral General Advertiser are located within a radius of 25km from Birkenhead. Others are broader in their coverage, they appear as less centred on one speci昀椀c locality, but on a wider region. The The Atherstone, Nuneaton, and Warwickshire Times can serve as an example. Even though 50% of all toponyms are less than 50km removed from Atherstone, just about 20% appear within a 25km radius of this town. Exploring the distribution of these toponyms, therefore, might be a valuable way of understanding how these working-class papers anchored themselves spatially, negotiating between local, regional and national identities.

Lastly, we scrutinised the transnational level, focusing on the imperial geographies embedded in these digitised newspapers. In his analysis of popular imperialism, Nicholson37[] relies on popular provincial newspapers to probe the attitudes of the working classes towards the imperialist project. Especially concerning the Second Boer War (1899-1902), he questions whether the working-class patriotic support for this endeavour was as strong as historians previously imagined. Looking at the results gathered from our corpus, it 昀椀rstly transpires that geographical mentions of the empire were relatively low, consistently hoovering around 5% of all the toponyms. Zooming in on Africa and Asia, the numbers are even lower—especially compared to references to locations in Canada and Australia—except around moments of crisis, such as the Second Boer War. Mentions of South African place names, for example, showed a dramatic increase at the end of the 19th century. These results are preliminary and should be complemented with additional content- and sentiment-based analyses in order to monitor more accurately the prevalence of jingoism in the popular press.

7. Conclusion

In this paper, we presented a comprehensive step-by-step examination of toponym linking for historical newspapers in English. We argued that a good performance on standard and highly generic benchmarks does not necessarily extrapolate to other domains. When applied to digitised historical newspapers, the accuracy of these state-of-the-art tools o昀琀en drop signi昀椀cantly, therefore hinting at the complexity of 昀椀nding a general solution to EL. We have presented and evaluated a new and very adaptable tool, T-Res, that resulted from these investigations: T-Res builds on top of robust NLP approaches, tailoring them to the speci昀椀c task of toponym linking in historical newspapers. We concluded with a historical case study that demonstrated how our pipeline supports ongoing research on the local, national and transnational dimensions of the popular press.

Acknowledgements

The authors are grateful to the reviewers for their careful and constructive reviews. Work for this paper was produced as part of Living with Machines. This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC grant AH/S01179X/1), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London. This work was also supported by The Alan Turing Institute (EPSRC grant EP/N510129/1). [26] K. Labusch and C. Neudecker. “Named Entity Disambiguation and Linking Historic Newspaper OCR with BERT”. In: Conference and Labs of the Evaluation Forum (CLEF).

Vol. 2696. CEUR Workshop Proceedings, 2020.

M. Van Erp, P. Mendes, H. Paulheim, F. Ilievski, J. Plu, G. Rizzo, and J. Waitelonis. “Evaluating entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job”. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation. European Language Resources Association, 2016, pp. 4373–4379.

A. Appendix: T-Res

The following code snippet shows how the T-Res pipeline works: from geoparser import pipeline, recogniser, ranking, linking myner = recogniser.Recogniser(...) # Instantiate the Recogniser myranker = ranking.Ranker(...) # Instantiate the Ranker mylinker = linking.Linker(...) # Instantiate the Linker geoparser = pipeline.Pipeline(myner=myner, myranker=myranker, mylinker=mylinker) output = geoparser.run_text("Inspector Liddle said: I am an inspector of police, living in the city of Durham.", place="Alston, Cumbria, England", place_wqid="Q2560190" ) The parentheses (...) indicate an ellipsis in the code, where the user has the option to instantiate each of the three modules (theRecogniser for named entity recognition, theRanker for candidate selection, and theLinker for entity disambiguation) according to their needs. For example, they may choose to instantiate aRecogniser that uses a speci昀椀c model for named entity recognition from the HuggingFace hub, or they may choose to train their own model, provided a base model and their own dataset (in the format required). They may instantiate a Ranker that, given a KB, uses the exact match approach to 昀椀nd candidates, or choose to train their own DeezyMatch model, given a dataset of positive and negative pairs, and use it for candidate selection. Likewise, they may instantiate aLinker module that, given a KB, uses the mostpopular approach, or they may train their ownrel disambiguation approach.

The following snippet shows the output from the previous command, as a json: [{"mention": "Durham", }, "string_match_score": {"Durham": [1.0, ["Q1137286", "Q5316477", "Q752266", "..."]]}, "prior_cand_score": { "Q179815": 0.881, "Q49229": 0.522, "Q5316459": 0.457, "Q17003433": 0.455, "Q23082": 0.313, "Q458393": 0.295, "Q1075483": 0.293 For each mention detected in the input text, our tool returns: • mention: mention as it appears in the text. • ner_score: NER con昀椀dence score. • pos: start character position of the mention in the sentence. • sent_idx: sentence index in the text. • end_pos: end character position of the mention in the sentence. • tag: name entity type. • sentence: target sentence. • prediction: predicted Wikidata entity. • ed_score: disambiguation con昀椀dence score. • cross_cand_score: selected candidates and their cross-candidate con昀椀dence scores. • string_match_score: selected candidates and their string matching con昀椀dence scores. • prior_cand_score: selected candidates and their prior con昀椀dence scores. • latlon: geographic coordinates of the predicted entity.

• wkdt_class: Most common Wikidata class of the predicted entity.

The tool can also be used in a step-wise manner, or for just one module in the pipeline. We provide full documentation in our GitHub repositoryh:ttps://github.com/Living-with-machi nes/T-Res.

[1]

Akbik ,

Blythe , and

Vollgraf . “ Contextual string embeddings for sequence labeling” . In: Proceedings of the 27th international conference on computational linguistics.

Santa

Fe : Association for Computational Linguistics, 2018 , pp. 1638 - 1649 .

[2]

Anderson . Imagined communities: Re昀氀ections on the origin and spread of nationalism . Verso books , 2006 .

Beals and E. Bell. “ The atlas of digitised newspapers and metadata: Reports from Oceanic Exchanges” . In: Loughborough: Loughborough University ( 2020 ). doi: 10 .6084/m 9.figshare. 11560059 .

[4]

Beelen ,

Lawrence , D. C. Wilson, and

Beavan . “ Bias and representativeness in digitized newspaper collections: Introducing the environmental scan” . InD:igital Scholarship in the Humanities 38.1 ( 2023 ), pp. 1 - 22 . doi: 10 .1093/llc/fqac037.

[5]

Boros ,

C.-E.

González-Gallardo ,

Giamphy ,

Hamdi ,

J. G.

Moreno , and

Doucet . “ Knowledge-based contexts for historical named entity recognition & linking”. InC: onference and Labs of the Evaluation Forum (CLEF) . Vol. 3180. CEUR Workshop Proceedings , 2022 .

[6]

Boros ,

E. L.

Pontes ,

L. A.

Cabrera-Diego ,

Hamdi ,

J. G.

Moreno ,

Sidère , and

Doucet . “ Robust named entity recognition and linking on historical multilingual documents” . In: Conference and Labs of the Evaluation Forum (CLEF) . Vol. 2696 . CEUR-WS Working Notes . 2020 .

[7]

Boroş ,

Hamdi ,

E. L.

Pontes ,

L.-A.

Cabrera-Diego ,

J. G.

Moreno ,

Sidere , and

Doucet . “ Alleviating digitization errors in named entity recognition for historical documents” . In: Proceedings of the 24th conference on computational natural language learning. Acl , 2020 , pp. 431 - 441 . doi: 10 .18653/v1/ 2020 .conll- 1 . 35 .

[8]

Bunescu and

Paşca . “ Using Encyclopedic Knowledge for Named entity Disambiguation” . In: 11th Conference of the European Chapter of the Association for Computational Linguistics. Trento: Association for Computational Linguistics , 2006 , pp. 9 - 16 .

Nanni , D. van Strien , and D. C. Wilson. “ A dataset for toponym resolution in nineteenthCentury English newspapers” . In: Journal of Open Humanities Data 8 ( 2022 ). doi: 10 .533 4/johd.56.

In: Proceedings of the 28th International Conference on Advances in Geographic Information Systems . 2020 , pp. 385 - 388 . doi: 10 .1145/3397536.3422236.

[11]

Devlin , M.-

Chang ,

Lee , and

Toutanova . “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” . In:Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). Minneapolis: Association for Computational Linguistics , 2019 , pp. 4171 - 4186 . doi: 10 .18653/v1/ N19 -1423.

[9] [10] [12] [13] M. L. Dıéz Platas , S. Ros

Munoz , E.

González-Blanco , P.

Ruiz Fabo , and E. Alvarez

Mellado . “ Medieval Spanish (12th-15th centuries) named entity recognition and attribute annotation system based on contextual information” . In:Journal of the Association for Information Science and Technology 72.2 ( 2021 ), pp. 224 - 238 . doi: 10 .1002/asi.24399.

Ehrmann , G. Colavizza,

Rochat , and

Kaplan . “ Diachronic evaluation of NER systems on old newspapers” . In: Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016 ). Bochumer Linguistische Arbeitsberichte , 2016 , pp. 97 - 107 .

[15] [17]

Ehrmann ,

Romanello ,

Flückiger , and

Clematide . “ Extended Overview of CLEF HIPE 2020 : Named Entity Processing on Historical Newspapers” . In:Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum . Ed. by

Cappellato , C. Eickho昀, N. Ferro, and

Névéol . Vol. 2696 . Thessaloniki: Ceur-ws, 2020 .

Ehrmann ,

Romanello , S. Najem-Meyer, A. Doucet, and

Clematide . “Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents” . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction (CLEF).

Springer , 2022 , pp. 423 - 446 . doi: 10 .1007/978-3- 031 -13643-6\_ 26 .

[16]

Ferragina and

Scaiella . “Tagme: on-the-昀氀y annotation of short text fragments (by Wikipedia entities)” . In:Proceedings of the 19th ACM international conference on Information and knowledge management . New York: Association for Computing Machinery, 2010 , pp. 1625 - 1628 . doi: 10 .1145/1871437.1871689.

O.-E. Ganea and T.

Hofmann . “ Deep Joint Entity Disambiguation with Local Neural Attention” . In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . Copenhagen: Association for Computational Linguistics, 2017 , pp. 2619 - 2629 .

doi: 10 .18653/v1/ D17 -1277.

[18] C.-E. González-Gallardo, E.

Boros , N.

Girdhar , A.

Hamdi , J. G.

Moreno , and

Doucet . “ Yes but ... Can ChatGPT identify entities in historical documents?” In:arXiv preprint arXiv:2303.17322 ( 2023 ).

[19]

Hamdi ,

Jean-Caurant ,

Sidère ,

Coustaty , and A. Doucet. “ Assessing and minimizing the impact of OCR quality on named entity recognition”. InD:igital Libraries for Open Knowledge (TPDL) . Springer, 2020 , pp. 87 - 101 . doi: 10 .1007/978-3- 030 -54956-5\_7.

[20]

Hobbs . A Fleet Street in every town: The provincial press in England, 1855- 1900 . Open Book Publishers, 2018 . doi: 10 .11647/obp.0152.

[21]

Hobbs . “ The deleterious dominance of The Times in nineteenth-century scholarship” . In: Journal of Victorian Culture 18.4 ( 2013 ), pp. 472 - 497 . doi: 10 .1080/13555502. 2013 . 854 519.

[22]

Hosseini ,

Beelen , G. Colavizza, and

M. Coll

Ardanuy . “ Neural language models for nineteenth-century English” . In: Journal of Open Humanities Data ( 2021 ). doi: 10 .5334/j ohd. 48 .

[23]

Hosseini ,

Nanni , and

M. Coll

Ardanuy . “ DeezyMatch: A 昀氀exible deep learning approach to fuzzy string matching” . In:Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations . Online: Association for Computational Linguistics , 2020 , pp. 62 - 69 . doi: 10 .18653/v1/ 2020 .emnlp-demos. 9 .

[24] J. M. van Hulst , F.

Hasibi , K.

Dercksen , K.

Balog , and A. P. de Vries. “REL: An Entity Linker Standing on the Shoulders of Giants” . In:Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. Sigir '20 . New York: Acm, 2020 , pp. 2197 - 2200 . doi: 10 .1145/3397271.3401416.

[25]

Kolitsas ,

O.-E.

Ganea , and

Hofmann . “ End-to-End Neural Entity Linking” . In:Proceedings of the 22nd Conference on Computational Natural Language Learning . Brussels, 2018 , pp. 519 - 529 . doi: 10 .18653/v1/ K18 -1050.

[27]

Le and I. Titov. “ Improving entity linking by modeling latent relations between mentions” . In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Melbourne: Association for Computational Linguistics , 2018 , pp. 1595 - 1604 . doi: 10 .18653/v1/ P18 -1148.

[28]

J. L.

Leidner . “ Toponym resolution in text: annotation, evaluation and applications of spatial grounding” . In:ACM SIGIR Forum . Vol. 41 . 2. New York: Association for Computing Machinery, 2007 , pp. 124 - 126 . doi: 10 .1145/1328964.1328989.

[29]

Linhares Pontes ,

L. A.

Cabrera-Diego ,

J. G.

Moreno ,

Boros ,

Hamdi ,

Doucet ,

Sidere , and

Coustaty . “ MELHISSA: a multilingual entity linking architecture for historical press articles” . In:International Journal on Digital Libraries 23.2 ( 2022 ), pp. 133 - 160 . doi: 10 .1007/s00799-021-00319-6.

[30]

Linhares Pontes ,

Hamdi ,

Sidere , and

Doucet . “ Impact of OCR quality on named entity linking” . In: Digital Libraries at the Crossroads of Digital Information for the Future: 21st International Conference on Asia-Paci昀椀c Digital Libraries . Springer, 2019 , pp. 102 - 115 . doi: 10 .1007/978-3- 030 -34058-2\_ 11 .

[31]

Manjavacas and

Fonteyn . “ Adapting vs. Pre-training Language Models for Historical Languages” . In: Journal of Data Mining & Digital Humanities Nlp4dh ( 2022 ). doi: 10 .462 98/jdmdh.9152.

[32]

McDonough ,

Moncla , and M. van de Camp. “ Named entity recognition goes to old regime France: geographic text analysis for early modern French corpora” . InI:nternational Journal of Geographical Information Science 33 .12 ( 2019 ), pp. 2498 - 2522 . doi: 10 .1080/13658816. 2019 . 1620235 .

[33]

P. N.

Mendes ,

Jakob ,

Garcıa ́-Silva, and

Bizer . “ DBpedia spotlight: shedding light on the web of documents” . In:Proceedings of the 7th international conference on semantic systems. 2011 , pp. 1 - 8 . doi: 10 .1145/2063518.2063519.

[34]

Mihalcea and

Csomai . “Wikify! Linking documents to encyclopedic knowledge” . In: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management . New York: Association for Computing Machinery, 2007 , pp. 233 - 242 . doi: 10 .1145/1321440.1321475.

[35]

Milne and

I. H.

Witten . “ Learning to link with Wikipedia” . InP:roceedings of the 17th ACM conference on Information and knowledge management . New York: Association for Computing Machinery, 2008 , pp. 509 - 518 . doi: 10 .1145/1458082.1458150.

[36]

Munnelly and

Lawless . “ Investigating entity linking in early English legal documents” . In: Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries . New York: Acm, 2018 , pp. 59 - 68 . doi: 10 .1145/3197026.3197055.

[37]

Nicholson . “ Popular Imperialism and the Provincial Press: Manchester Evening and Weekly Papers , 1895 - 1902 ”. In: Victorian Periodicals Review 13.3 ( 1980 ), pp. 85 - 96 .

[38]

Olieman ,

Beelen , M. van Lange , J.

Kamps , and M.

Marx . “ Good Applications for Crummy Entity Linkers? The Case of Corpus Selection in Digital Humanities” . Ina:rXiv preprint arXiv:1708.01162 . 2017 .

Pedrazzini and B. McGillivray. “ Machines in the media: semantic change in the lexicon of mechanization in 19th-century British newspapers” . In:Proceedings of the 2nd International Workshop on Natural Language Processing for Digital Humanities. Taipei: Association for Computational Linguistics , 2022 , pp. 85 - 95 .

[40]

Piccinno and

Ferragina . “ From TagME to WAT: a new entity annotator” . InP:roceedings of the 昀椀rst international workshop on Entity recognition & disambiguation . New York: Association for Computing Machinery, 2014 , pp. 55 - 62 . doi: 10 .1145/2633211.2634350.

[41]

Ratinov ,

Roth ,

Downey , and

Anderson . “ Local and global algorithms for disambiguation to Wikipedia” . In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies . Portland , 2011 , pp. 1375 - 1384 .

Rovera ,

Nanni ,

S. P.

Ponzetto , and

Goy . “ Domain-speci昀椀c Named Entity Disambiguation in Historical Memoirs” . In:Proceedings of the Fourth Italian Conference on Computational Linguistics (CLiC-it 2017 ). Vol. 2006 . CEUR Workshop Proceedings. Rome, 2017 .

[43]

Santos ,

Murrieta-Flores ,

Calado , and

Martins . “ Toponym matching through deep neural networks” . In: International Journal of Geographical Information Science 32.2 ( 2018 ), pp. 324 - 348 . doi: 10 .1080/13658816. 2017 . 1390119 .

[44]

Schweter ,

März ,

Schmid , and

Çano . “hmBERT: Historical Multilingual Language Models for Named Entity Recognition” . In:Conference and Labs of the Evaluation Forum (CLEF) . Vol. 3180. CEUR Workshop Proceedings , 2022 .

[45]

Sil , G. Kundu,

Florian , and

Hamza . “ Neural cross-lingual entity linking” . In: Thirty-Second AAAI Conference on Arti昀椀cial Intelligence . AAAI Press, 2018 , pp. 5464 - 5472 .

“Assessing the impact of OCR quality on downstream NLP tasks” . In:Proceedings of the 12th International Conference on Agents and Arti昀椀cial Intelligence (ICAART) . Volume 1: ARTIDIGH. 2020 , pp. 484 - 496 .

[47]

Tolfo ,

Vane ,

Beelen ,

Hosseini ,

Lawrence ,

Beavan , and

McDonough . “ Hunting for Treasure: Living with Machines and the British Library Newspaper Collection” . In: Digitised Newspapers - A New Eldorado for Historians?: Re昀氀ections on Tools, Methods and Epistemology. Ed. by E. Bunout,

Ehrmann , and

Clavert. De Gruyter Oldenbourg , 2023 , pp. 23 - 46 . doi: 10 .1515/ 9783110729214 - 002 .

[49]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. Le

Scao ,

Gugger ,

Drame ,

Lhoest , and

Rush . “Transformers: State-of-theArt Natural Language Processing” . In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics , 2020 , pp. 38 - 45 . doi: 10 .18653/v1/ 2020 .emnlp-demos. 6 .

[50]

Yamada ,

Asai ,

Sakuma ,

Shindo ,

Takeda ,

Takefuji , and

Matsumoto . “ Wikipedia2Vec: An E昀케cient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia” . InP:roceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics , 2020 , pp. 23 - 30 . doi: 10 .18653/v1/ 2020 .emnlp-demos. 4 .

[51]

Yamada ,

Takeda , and

Takefuji . “ Enhancing named entity recognition in Twitter messages using entity linking” . In: Proceedings of the Workshop on Noisy User-generated Text . Beijing: Association for Computational Linguistics, 2015 , pp. 136 - 140 . doi1 : 0 .186 53/v1/ W15 -4320.

"ner_score": 0 . 999 , "pos": 74, "sent_idx": 0, "end_pos": 80, "tag": "LOC", "sentence": "Inspector Liddle said: I am an inspector of police, living in the city of Durham.", "prediction": "Q179815", " ed_score": 0 . 039 , "cross_cand_score": { "Q179815": 0.396, " Q23082": 0.327, "Q49229": 0.141, "Q5316459": 0.049, "Q458393": 0.045, "Q17003433": 0.042, "Q1075483": 0 . 0