Arretium or Arezzo? A Neural Approach to the Identification of Place Names in Historical Texts Rachele Sprugnoli Fondazione Bruno Kessler, Via Sommarive 18, Povo (TN) sprugnoli@fbk.eu Abstract place names, a specific sub-task that in DH is envisaged as the first step towards the complete English. This paper presents the applica- geoparsing of historical texts, which final aim is tion of a neural architecture to the identifi- to discover and analyse spatial patterns in vari- cation of place names in English historical ous fields, from environmental history to literary texts. We test the impact of different word studies, from historical demography to archaeol- embeddings and we compare the results to ogy (Gregory et al., 2015). More specifically, we the ones obtained with the Stanford NER propose a neural approach applied to a new manu- module of CoreNLP before and after the ally annotated corpus of historical travel writings. retraining using a novel corpus of manu- In our experiments we test the performance of dif- ally annotated historical travel writings. ferent pre-trained word embeddings, including a Italiano. Questo articolo presenta set of word vectors we created starting from histor- l’applicazione di un’architettura neurale ical texts. Resources employed in the experiments all’identificazione dei nomi propri di lu- are publicly released together with the model that ogo all’interno di testi storici in lingua achieved the best results in our task1 . inglese. Abbiamo valutato l’impatto di 2 Related Work vari word embedding e confrontato i risul- tati con quelli ottenuti usando il modulo Different domains - such as Chemistry, NER di Stanford CoreNLP prima e dopo Biomedicine and Public Administration (El- averlo riaddestrato usando un nuovo cor- tyeb and Salim, 2014; Habibi et al., 2017; Passaro pus di lettaratura di viaggio storica man- et al., 2017) - have dealt with the NER task by ualmente annotato. developing domain-specific guidelines and auto- matic systems based on both machine learning and deep learning algorithms (Nadeau and Sekine, 1 Introduction 2007; Ma and Hovy, 2016). In the field of Digital Named Entity Recognition (NER), that is the au- Humanities, applications have been proposed for tomatic identification and classification of proper the domains of Literature, History and Cultural names in texts, is one of the main tasks of Natural Heritage (Borin et al., 2007; Van Hooland et al., Language Processing (NLP), having a long tradi- 2013; Sprugnoli et al., 2016). In particular, the tion started in 1996 with the first major event ded- computational treatment of historical newspapers icated to it, i.e. the Sixth Message Understanding has received much attention being, at the moment, Conference (MUC-6) (Grishman and Sundheim, the most investigated text genre (Jones and Crane, 1996). In the field of Digital Humanities (DH), 2006; Neudecker et al., 2014; Mac Kim and NER is considered as one of the important chal- Cassidy, 2015; Neudecker, 2016; Rochat et al., lenges to tackle for the processing of large cultural 2016). datasets (Kaplan, 2015). The language variety of Person, Organization and Location historical texts is however greatly different from are the three basic types adopted by general- the one of the contemporary texts NER systems purpose NER systems, even if different entity are usually developed to annotate, thus an adapta- types can be detected as well, depending on tion of current systems is needed. 1 https://dh.fbk.eu/technologies/ In this paper, we focus on the identification of place-names-historical-travel-writings the guidelines followed for the manual annota- uments and archaeological sites (Forum Ro- tion of the training data (Tjong Kim Sang and manum) and streets (Via dell’Indipendenza). De Meulder, 2003; Doddington et al., 2004). For example, political, geographical and functional The three aforementioned definitions correspond locations can be merged in a unique type or to three entity types of the ACE guidelines, i.e., identified by different types: in any case, their GPE (geo-political entities), LOC (locations) and detection has assumed a particular importance in FAC (facilities): we extended this latter type to the context of the spatial humanities framework, cover material cultural assets, that is the built cul- that puts the geographical analysis at the center of tural inheritance made of buildings, sites, mon- humanities research (Bodenhamer, 2012). How- uments that constitute relevant locations in the ever, in this domain, the lack of pre-processing travel domain. tools, linguistic resources, knowledge-bases and The annotation required 3 person/days of work gazetteers is considered as a major limitation to and, at the end, 2,228 proper names of locations the development of NER systems with a good were identified in the corpus, among which 657 accuracy (Ehrmann et al., 2016). were multi-token (29.5%). The inter-annotator Compared to previous works, our study focuses agreement, calculated on a subset of 3,200 tokens, on a text genre not much investigated in NLP achieved a Cohen’s kappa coefficient of 0.93 (Co- but of great importance from the historical and hen, 1960), in line with previous results on named cultural point of view: travel writings are indeed entities annotation in historical texts (Ehrmann et a source of information for many research areas al., 2016). and are also the most representative type of The annotation highlighted the presence of spe- intercultural narrative (Burke, 1997; Beaven, cific phenomena characterising place names in 2007). In addition, we face the problem of poor historical travel writings. First of all, the same resource coverage by releasing new historical place can be recorded with variations in spelling word vectors and testing an architecture that does across different texts but also in the same text: for not require any manual feature selection, and thus example, modern names can appear together with neither text pre-processing nor gazetteers. the corresponding ancient names (Trapani gradu- ally assumes the form that gave it its Greek name 3 Manual Annotation of Drepanum) and places can be addressed by us- ing both the English name and the original one, the We manually annotated a corpus of 100,000 to- latter occurring in particular in code-mixing pas- kens divided in 38 texts taken from a collection sages (Sprugnoli et al., 2017) such as in: (Byron of English travel writings (both travel reports and himself hated the recollection of his life in Venice, guidebooks) about Italy published in the second and I am sure no one else need like it. But he is half of the XIX century and the ’30s of the XX become a cosa di Venezia, and you cannot pass century (Sprugnoli, 2018). The tag Location his palace without having it pointed out to you by was used to mark all named entities (including the gondoliers.). Second, some names are written nicknames like city on the seven hill) referring to: with the original Latin alphabet graphemes, such • geographical locations: landmasses (Janicu- as Ætna and Tropæa Marii. Then, there are names lum Hill, Vesuvius), body of waters (Tiber, having a wrong spelling: e.g., Cammaiore instead Mediterranean Sea), celestial bodies (Mars), of Camaiore and Momio instead of Mommio. In natural areas (Campagna Romana, Sorren- addition, there are several long multi-token proper tine Peninsula); names, especially in case of churches and other historical sites, e.g. House of the Tragic Poet, • political locations: areas defined by socio- Church of San Pietro in Vincoli, but also abbrevi- political groups, such as cities (Venice, ated names used to anonymise personal addresses, Palermo), regions (Tuscany, Lazio), king- e.g. Hotel B.. Travel writings included in the cor- doms (Regno delle due Sicilie), nations (Italy, pus are about cities and regions of throughout Italy Vatican); thus there is a high diversity in the mentioned lo- • functional locations: areas and places that cations, from valleys in the Alps (Val Buona) to serve a particular purpose, such as facilities small villages in Sicily (Capo S. Vito). However, (Hotel Riposo, Church of St. Severo), mon- even if the main topic of the corpus is the descrip- tion of travels in Italy, there are also references to Gurevych (2017a) for the NER task is summarised places outside the country, typically used to make below: comparisons (Piedmont, in Italy, is nothing at all • dropout: 0.25, 0.25 like neighbouring Dauphiné or Savoie). • classifier: CRF • LSTM-Size: 100 4 Experiments • optimizer: NADAM • word embeddings: GloVe Common Crawl Experiments for the automatic identification of 840B place names were carried out using the annotated • character embeddings: CNN corpus described in the previous Section. The cor- • miniBatchSize: 32 pus, in BIO format, was divided in a training, a Starting from this configuration, we evaluated test and a development set following a 80/10/10 the performance of the NER classifier trying dif- split. For the classification, we tested two ap- ferent pre-trained word embeddings. Given that proaches: we retrained the NER module of Stan- the score of a single run is not significant due to the ford CoreNLP with our in-domain annotated cor- different results producing by different seed values pus and we used a BiLSTM implementation evalu- (Reimers and Gurevych, 2017b), we run the sys- ating the impact of different word embeddings, in- tem three times and we calculated the average of cluding three new historical pre-trained word vec- the test score corresponding to the epoch with the tors. highest result on the development test. We used 4.1 Retraining of Stanford NER Module Keras version 1.04 and Theano 1.0.05 as backend; we stopped after 10 epochs in case of no improve- The NER system integrated in Stanford CoreNLP ments on the development set. is an implementation of Conditional Random Field (CRF) sequence models (Finkel et al., 2005) 4.2.1 Pre-trained Word Embeddings trained on a corpus made by several datasets We tested a set of word vectors available online, all (CONLL, MUC-6, MUC-7, ACE) for a total of with 300 dimensions, built on corpora of contem- more than one million tokens2 . The model dis- porary texts and widely adopted in several NLP tributed with the CoreNLP distribution is there- tasks, namely: (i) GloVe embeddings, trained on fore based on contemporary texts, most of them a corpus of 840 billion tokens taken from Com- of the news genre but also weblogs, newsgroup mon Crawl data (Pennington et al., 2014); (ii) messages and broadcast conversations. We eval- Levy and Goldberg embeddings, produced from uated this model (belonging to the 3.8.0 release of the English Wikipedia with a dependency-based CoreNLP) on our test set and then we trained a approach (Levy and Goldberg, 2014); (iii) fastText new CRF model using our training data. embeddings, trained on the English Wikipedia using sub-word information (Bojanowski et al., 4.2 Neural Approach 2017). By taking into consideration these pre- We adopted an implementation of BiLSTM- trained embeddings, we cover different types of CRF developed from the Ubiquitous Knowledge word representation: GloVe is based on linear bag- Processing Lab (Technische Universität Darm- of-words contexts, Levy on dependency parse- stadt)3 . This architecture exploits casing infor- trees, and fastText on a bag of character n-grams. mation, character embeddings and word embed- In addition, we employed word vectors we de- dings; no feature engineering is required (Reimers veloped using GloVe, fastText and Levy and Gold- and Gurevych, 2017a). We chose this imple- berg’s algorithms on a a subset of the Corpus mentation because the authors propose recom- of Historical American English (COHA) (Davies, mended hyperparameter configurations for several 2012) made of more than 198 million words. The sequence labelling tasks, including NER, that we chosen subset contains more than 3,800 texts be- took as a reference for our own experiments. More longing to four genres (i.e., fiction, non-fiction, specifically, the setup suggested by Reimers and newspaper, magazine) published in the same tem- poral span of our corpus of travel writings. These 2 https://nlp.stanford.edu/software/ 4 CRF-NER.html https://keras.io/ 3 5 https://github.com/UKPLab/ http://deeplearning.net/software/ emnlp2017-bilstm-cnn-crf theano/ historical embeddings, named HistoGlove, Histo- P R F1 Fast and HistoLevy, are available online6 . Stanford NER 82.1 66.1 73.2 Retrained Stanford NER 78.9 79.2 79.1 5 Results and Discussion Neural HistoLevy 85.3 83.3 84.3 Table 1 shows the results of our experiments in Neural Levy 83.7 86.8 85.3 terms of precision (P), recall (R) and F-measure Neural HistoFast 83.9 87.4 85.6 (F1): the score obtained with the Stanford NER Neural GloVe 83.7 87.9 86.0 module before and after the retraining is compared Neural FastText 86.3 86.3 86.3 with the one achieved with the deep learning ar- Neural HistoGlove 86.4 88.5 87.4 chitecture and different pre-trained word embed- Table 1: Results of the experiments. dings. The neural approach performs remarkably bet- Stanford Neural ter than the CFR sequence models with a differ- NER HistoGloVe ence ranging from 11 to 14 points in terms of F1, F1 F1 depending on the word vectors used. The orig- Little Pilgrimage 80.9 90.7 inal Stanford module produces much unbalanced Naples Riviera 73.3 86.0 results with the lowest recall and F1 but a preci- Rome 55.6 80.9 sion above 82. In all the other experiments, scores are more balanced even if in the majority of the Table 2: Comparison of F1 in the three test files. neural experiments recall is slightly higher than precision, meaning that BiLSTM is more able to presence of many Latin place names and locations generalise the observations of named entities from related to the ancient (and even mythological) his- the training data. Although the training data are tory of the city of Rome, e.g. Grotto of Lupercus, few, compared to the corpora used for the orig- Alba Longa. As displayed in Table 2, Neural His- inal Stanford NER module, they produce an im- toGloVe increases the F1 score of 9.8 points on the provement of 13.1 and 5.9 points on recall and F1 first file, 12.7 on the second and 25.3 on the third. respectively, demonstrating the positive impact of having in-domain annotated data. 6 Conclusions and Future Works As for word vectors, dependency-based embed- In this paper we presented the application of a neu- dings are not the best word representation for the ral architecture to the automatic identification of NER task having the lowest F1 among the exper- place names in historical texts. We chose to work iments with the neural architecture. It is worth on an under-investigated text genre, namely travel noticing that GloVe, suggested as the best word writings, that presents a set of specific linguistic vectors by Reimers and Gurevych (2017a) for the features making the NER task particularly chal- NER task on contemporary texts, does not achieve lenging. The deep learning approach, combined the best scores on our historical corpus. Linear with in-domain training set and in-domain histori- bag-of-words contexts is however confirmed as cal embeddings, outperforms the linear CRF clas- the most appropriate word representation for the sifier of the Stanford NER module without the identification of Named Entities, given that His- need of performing feature engineering. Anno- toGloVe produces the highest scores for all the tated corpus, best model and historical word vec- three metrics. tors are all freely available online. The improvement obtained with the neural ap- As for future work, we plan to experiment with proach combined with historical word vectors and a finer-grained classification so to distinguish dif- in-domain training data is evident when looking ferent types of locations. In addition, another as- in details at the results over the three files con- pect worth studying is the georeferencing of iden- stituting the test set. These texts were extracted tified place names so to map the geographical di- from two travel reports, “A Little Pilgrimage in mension of travel writings in Italy. An example Italy” (1911) and “Naples Riviera” (1907) and one of visualisation is given in Figure 1 where the lo- guidebook, “Rome” (1905). The text taken from cations automatically identified from the test file the latter book is particularly challenging for the taken from the book “Naples Riviera” are dis- 6 http://bit.do/esiaS played: place names have been georeferenced us- Figure 1: Map of place names in the Neapolitan area mentioned in the “Naples Riviera” test file. ing the Geocoding API7 offered by Google and Lars Borin, Dimitrios Kokkinakis, and Leif-Jöran Ols- displayed through the Carto8 web mapping tool. son. 2007. Naming the past: Named entity and ani- macy recognition in 19th century Swedish literature. Another interesting work would be the detection In Proceedings of the Workshop on Language Tech- of itineraries of past travellers: this application nology for Cultural Heritage Data (LaTeCH 2007), could have a potential impact on the tourism sec- pages 1–8. tor, suggesting historical routes alternative to those Peter Burke. 1997. Varieties of cultural history. Cor- more beaten and congested and making tourists re- nell University Press. discovering sites long forgotten. Jacob Cohen. 1960. A coefficient of agreement Acknowledgments for nominal scales. Educational and psychological measurement, 20(1):37–46. The author wants to thank Manuela Speranza for Mark Davies. 2012. Expanding horizons in historical her help with inter-annotator agreement. linguistics with the 400-million word Corpus of His- torical American English. Corpora, 7(2):121–157. References George R Doddington, Alexis Mitchell, Mark A Przy- bocki, Lance A Ramshaw, Stephanie Strassel, and Tita Beaven. 2007. A life in the sun: Accounts of new Ralph M Weischedel. 2004. The Automatic Content lives abroad as intercultural narratives. Language Extraction (ACE) Program-Tasks, Data, and Evalu- and Intercultural Communication, 7(3):188–202. ation. In LREC, volume 2, pages 837–840. David J Bodenhamer. 2012. The spatial humanities: Maud Ehrmann, Giovanni Colavizza, Yannick Rochat, space, time and place in the new digital age. In His- and Frédéric Kaplan. 2016. Diachronic evaluation tory in the Digital Age, pages 35–50. Routledge. of NER systems on old newspapers. In Proceedings of the 13th Conference on Natural Language Pro- Piotr Bojanowski, Edouard Grave, Armand Joulin, and cessing (KONVENS 2016)), number EPFL-CONF- Tomas Mikolov. 2017. Enriching Word Vectors 221391, pages 97–107. Bochumer Linguistische Ar- with Subword Information. Transactions of the As- beitsberichte. sociation for Computational Linguistics, 5:135–146. Safaa Eltyeb and Naomie Salim. 2014. Chemi- 7 cal named entities recognition: a review on ap- https://developers.google.com/maps/ documentation/geocoding/start proaches and applications. Journal of cheminfor- 8 https://carto.com/ matics, 6(1):17. Jenny Rose Finkel, Trond Grenager, and Christopher Lucia C Passaro, Alessandro Lenci, and Anna Gab- Manning. 2005. Incorporating non-local informa- bolini. 2017. INFORMed PA: A NER for the Ital- tion into information extraction systems by Gibbs ian Public Administration Domain. In Fourth Italian sampling. In Proceedings of the 43rd annual meet- Conference on Computational Linguistics CLiC-it ing on association for computational linguistics, 2017, pages 246–251. Accademia University Press. pages 363–370. Association for Computational Lin- guistics. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word Ian Gregory, Christopher Donaldson, Patricia representation. In Proceedings of the 2014 confer- Murrieta-Flores, and Paul Rayson. 2015. ence on empirical methods in natural language pro- Geoparsing, GIS, and textual analysis: Current cessing (EMNLP), pages 1532–1543. developments in spatial humanities research. International Journal of Humanities and Arts Nils Reimers and Iryna Gurevych. 2017a. Optimal hy- Computing, 9(1):1–14. perparameters for deep lstm-networks for sequence labeling tasks. arXiv preprint arXiv:1707.06799. Ralph Grishman and Beth Sundheim. 1996. Mes- sage understanding conference-6: A brief history. Nils Reimers and Iryna Gurevych. 2017b. Report- In COLING 1996 Volume 1: The 16th Interna- ing Score Distributions Makes a Difference: Perfor- tional Conference on Computational Linguistics, mance Study of LSTM-networks for Sequence Tag- volume 1. ging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Process- Maryam Habibi, Leon Weber, Mariana Neves, ing, pages 338–348. David Luis Wiegandt, and Ulf Leser. 2017. Deep learning with word embeddings improves biomed- Yannick Rochat, Maud Ehrmann, Vincent Buntinx, ical named entity recognition. Bioinformatics, Cyril Bornet, and Frédéric Kaplan. 2016. Navigat- 33(14):i37–i48. ing through 200 years of historical newspapers. In iPRES 2016, number EPFL-CONF-218707. Alison Jones and Gregory Crane. 2006. The chal- lenge of Virginia Banks: an evaluation of named en- Rachele Sprugnoli, Giovanni Moretti, Sara Tonelli, and tity analysis in a 19th-century newspaper collection. Stefano Menini. 2016. Fifty years of European his- In Digital Libraries, 2006. JCDL’06. Proceedings of tory through the lens of computational linguistics: the 6th ACM/IEEE-CS Joint Conference on, pages the De Gasperi Project. Italian Journal of Computa- 31–40. IEEE. tional Linguistics, pages 89–100. Frédéric Kaplan. 2015. A map for big data research in Rachele Sprugnoli, Sara Tonelli, Giovanni Moretti, and digital humanities. Frontiers in Digital Humanities, Stefano Menini. 2017. A little bit of bella pia- 2:1. nura: Detecting Code-Mixing in Historical English Omer Levy and Yoav Goldberg. 2014. Dependency- Travel Writing. In Proceedings of the Fourth Italian Based Word Embeddings. In ACL (2), pages 302– Conference on Computational Linguistics (CLiC-it 308. 2017). Xuezhe Ma and Eduard Hovy. 2016. End-to-end Se- Rachele Sprugnoli. 2018. “Two days we have passed quence Labeling via Bi-directional LSTM-CNNs- with the ancients...”: a Digital Resource of Histori- CRF. In Proceedings of the 54th Annual Meeting of cal Travel Writings on Italy. In Book of Abstract of the Association for Computational Linguistics (Vol- AIUCD 2018 Conference. AIUCD. ume 1: Long Papers), volume 1, pages 1064–1074. Erik F Tjong Kim Sang and Fien De Meulder. Sunghwan Mac Kim and Steve Cassidy. 2015. Find- 2003. Introduction to the CoNLL-2003 shared task: ing names in trove: named entity recognition for Language-independent named entity recognition. In Australian historical newspapers. In Proceedings of Proceedings of the seventh conference on Natural the Australasian Language Technology Association language learning at HLT-NAACL 2003-Volume 4, Workshop 2015, pages 57–65. pages 142–147. Association for Computational Lin- guistics. David Nadeau and Satoshi Sekine. 2007. A sur- vey of named entity recognition and classification. Seth Van Hooland, Max De Wilde, Ruben Verborgh, Lingvisticae Investigationes, 30(1):3–26. Thomas Steiner, and Rik Van de Walle. 2013. Ex- ploring entity recognition and disambiguation for Clemens Neudecker, Lotte Wilms, Wille Jaan Faber, cultural heritage collections. Digital Scholarship in and Theo van Veen. 2014. Large-scale refinement the Humanities, 30(2):262–279. of digital historic newspapers with named entity recognition. In Proc IFLA Newspapers/GENLOC Pre-Conference Satellite Meeting. Clemens Neudecker. 2016. An Open Corpus for Named Entity Recognition in Historic Newspapers. In LREC.