Arretium or Arezzo? A Neural Approach to the Identification of Place
                         Names in Historical Texts

                                      Rachele Sprugnoli
                     Fondazione Bruno Kessler, Via Sommarive 18, Povo (TN)
                                    sprugnoli@fbk.eu


                     Abstract                           place names, a specific sub-task that in DH is
                                                        envisaged as the first step towards the complete
    English. This paper presents the applica-           geoparsing of historical texts, which final aim is
    tion of a neural architecture to the identifi-      to discover and analyse spatial patterns in vari-
    cation of place names in English historical         ous fields, from environmental history to literary
    texts. We test the impact of different word         studies, from historical demography to archaeol-
    embeddings and we compare the results to            ogy (Gregory et al., 2015). More specifically, we
    the ones obtained with the Stanford NER             propose a neural approach applied to a new manu-
    module of CoreNLP before and after the              ally annotated corpus of historical travel writings.
    retraining using a novel corpus of manu-            In our experiments we test the performance of dif-
    ally annotated historical travel writings.          ferent pre-trained word embeddings, including a
    Italiano.       Questo articolo presenta            set of word vectors we created starting from histor-
    l’applicazione di un’architettura neurale           ical texts. Resources employed in the experiments
    all’identificazione dei nomi propri di lu-          are publicly released together with the model that
    ogo all’interno di testi storici in lingua          achieved the best results in our task1 .
    inglese. Abbiamo valutato l’impatto di
                                                        2   Related Work
    vari word embedding e confrontato i risul-
    tati con quelli ottenuti usando il modulo           Different domains - such as Chemistry,
    NER di Stanford CoreNLP prima e dopo                Biomedicine and Public Administration (El-
    averlo riaddestrato usando un nuovo cor-            tyeb and Salim, 2014; Habibi et al., 2017; Passaro
    pus di lettaratura di viaggio storica man-          et al., 2017) - have dealt with the NER task by
    ualmente annotato.                                  developing domain-specific guidelines and auto-
                                                        matic systems based on both machine learning
                                                        and deep learning algorithms (Nadeau and Sekine,
1   Introduction                                        2007; Ma and Hovy, 2016). In the field of Digital
Named Entity Recognition (NER), that is the au-         Humanities, applications have been proposed for
tomatic identification and classification of proper     the domains of Literature, History and Cultural
names in texts, is one of the main tasks of Natural     Heritage (Borin et al., 2007; Van Hooland et al.,
Language Processing (NLP), having a long tradi-         2013; Sprugnoli et al., 2016). In particular, the
tion started in 1996 with the first major event ded-    computational treatment of historical newspapers
icated to it, i.e. the Sixth Message Understanding      has received much attention being, at the moment,
Conference (MUC-6) (Grishman and Sundheim,              the most investigated text genre (Jones and Crane,
1996). In the field of Digital Humanities (DH),         2006; Neudecker et al., 2014; Mac Kim and
NER is considered as one of the important chal-         Cassidy, 2015; Neudecker, 2016; Rochat et al.,
lenges to tackle for the processing of large cultural   2016).
datasets (Kaplan, 2015). The language variety of           Person, Organization and Location
historical texts is however greatly different from      are the three basic types adopted by general-
the one of the contemporary texts NER systems           purpose NER systems, even if different entity
are usually developed to annotate, thus an adapta-      types can be detected as well, depending on
tion of current systems is needed.                        1
                                                            https://dh.fbk.eu/technologies/
   In this paper, we focus on the identification of     place-names-historical-travel-writings
the guidelines followed for the manual annota-               uments and archaeological sites (Forum Ro-
tion of the training data (Tjong Kim Sang and                manum) and streets (Via dell’Indipendenza).
De Meulder, 2003; Doddington et al., 2004). For
example, political, geographical and functional         The three aforementioned definitions correspond
locations can be merged in a unique type or             to three entity types of the ACE guidelines, i.e.,
identified by different types: in any case, their       GPE (geo-political entities), LOC (locations) and
detection has assumed a particular importance in        FAC (facilities): we extended this latter type to
the context of the spatial humanities framework,        cover material cultural assets, that is the built cul-
that puts the geographical analysis at the center of    tural inheritance made of buildings, sites, mon-
humanities research (Bodenhamer, 2012). How-            uments that constitute relevant locations in the
ever, in this domain, the lack of pre-processing        travel domain.
tools, linguistic resources, knowledge-bases and           The annotation required 3 person/days of work
gazetteers is considered as a major limitation to       and, at the end, 2,228 proper names of locations
the development of NER systems with a good              were identified in the corpus, among which 657
accuracy (Ehrmann et al., 2016).                        were multi-token (29.5%). The inter-annotator
   Compared to previous works, our study focuses        agreement, calculated on a subset of 3,200 tokens,
on a text genre not much investigated in NLP            achieved a Cohen’s kappa coefficient of 0.93 (Co-
but of great importance from the historical and         hen, 1960), in line with previous results on named
cultural point of view: travel writings are indeed      entities annotation in historical texts (Ehrmann et
a source of information for many research areas         al., 2016).
and are also the most representative type of               The annotation highlighted the presence of spe-
intercultural narrative (Burke, 1997; Beaven,           cific phenomena characterising place names in
2007). In addition, we face the problem of poor         historical travel writings. First of all, the same
resource coverage by releasing new historical           place can be recorded with variations in spelling
word vectors and testing an architecture that does      across different texts but also in the same text: for
not require any manual feature selection, and thus      example, modern names can appear together with
neither text pre-processing nor gazetteers.             the corresponding ancient names (Trapani gradu-
                                                        ally assumes the form that gave it its Greek name
3    Manual Annotation                                  of Drepanum) and places can be addressed by us-
                                                        ing both the English name and the original one, the
We manually annotated a corpus of 100,000 to-
                                                        latter occurring in particular in code-mixing pas-
kens divided in 38 texts taken from a collection
                                                        sages (Sprugnoli et al., 2017) such as in: (Byron
of English travel writings (both travel reports and
                                                        himself hated the recollection of his life in Venice,
guidebooks) about Italy published in the second
                                                        and I am sure no one else need like it. But he is
half of the XIX century and the ’30s of the XX
                                                        become a cosa di Venezia, and you cannot pass
century (Sprugnoli, 2018). The tag Location
                                                        his palace without having it pointed out to you by
was used to mark all named entities (including
                                                        the gondoliers.). Second, some names are written
nicknames like city on the seven hill) referring to:
                                                        with the original Latin alphabet graphemes, such
    • geographical locations: landmasses (Janicu-       as Ætna and Tropæa Marii. Then, there are names
      lum Hill, Vesuvius), body of waters (Tiber,       having a wrong spelling: e.g., Cammaiore instead
      Mediterranean Sea), celestial bodies (Mars),      of Camaiore and Momio instead of Mommio. In
      natural areas (Campagna Romana, Sorren-           addition, there are several long multi-token proper
      tine Peninsula);                                  names, especially in case of churches and other
                                                        historical sites, e.g. House of the Tragic Poet,
    • political locations: areas defined by socio-      Church of San Pietro in Vincoli, but also abbrevi-
      political groups, such as cities (Venice,         ated names used to anonymise personal addresses,
      Palermo), regions (Tuscany, Lazio), king-         e.g. Hotel B.. Travel writings included in the cor-
      doms (Regno delle due Sicilie), nations (Italy,   pus are about cities and regions of throughout Italy
      Vatican);                                         thus there is a high diversity in the mentioned lo-
    • functional locations: areas and places that       cations, from valleys in the Alps (Val Buona) to
      serve a particular purpose, such as facilities    small villages in Sicily (Capo S. Vito). However,
      (Hotel Riposo, Church of St. Severo), mon-        even if the main topic of the corpus is the descrip-
tion of travels in Italy, there are also references to   Gurevych (2017a) for the NER task is summarised
places outside the country, typically used to make       below:
comparisons (Piedmont, in Italy, is nothing at all          • dropout: 0.25, 0.25
like neighbouring Dauphiné or Savoie).                     • classifier: CRF
                                                            • LSTM-Size: 100
4     Experiments                                           • optimizer: NADAM
                                                            • word embeddings: GloVe Common Crawl
Experiments for the automatic identification of
                                                               840B
place names were carried out using the annotated
                                                            • character embeddings: CNN
corpus described in the previous Section. The cor-
                                                            • miniBatchSize: 32
pus, in BIO format, was divided in a training, a
                                                            Starting from this configuration, we evaluated
test and a development set following a 80/10/10
                                                         the performance of the NER classifier trying dif-
split. For the classification, we tested two ap-
                                                         ferent pre-trained word embeddings. Given that
proaches: we retrained the NER module of Stan-
                                                         the score of a single run is not significant due to the
ford CoreNLP with our in-domain annotated cor-
                                                         different results producing by different seed values
pus and we used a BiLSTM implementation evalu-
                                                         (Reimers and Gurevych, 2017b), we run the sys-
ating the impact of different word embeddings, in-
                                                         tem three times and we calculated the average of
cluding three new historical pre-trained word vec-
                                                         the test score corresponding to the epoch with the
tors.
                                                         highest result on the development test. We used
4.1    Retraining of Stanford NER Module                 Keras version 1.04 and Theano 1.0.05 as backend;
                                                         we stopped after 10 epochs in case of no improve-
The NER system integrated in Stanford CoreNLP            ments on the development set.
is an implementation of Conditional Random
Field (CRF) sequence models (Finkel et al., 2005)        4.2.1 Pre-trained Word Embeddings
trained on a corpus made by several datasets             We tested a set of word vectors available online, all
(CONLL, MUC-6, MUC-7, ACE) for a total of                with 300 dimensions, built on corpora of contem-
more than one million tokens2 . The model dis-           porary texts and widely adopted in several NLP
tributed with the CoreNLP distribution is there-         tasks, namely: (i) GloVe embeddings, trained on
fore based on contemporary texts, most of them           a corpus of 840 billion tokens taken from Com-
of the news genre but also weblogs, newsgroup            mon Crawl data (Pennington et al., 2014); (ii)
messages and broadcast conversations. We eval-           Levy and Goldberg embeddings, produced from
uated this model (belonging to the 3.8.0 release of      the English Wikipedia with a dependency-based
CoreNLP) on our test set and then we trained a           approach (Levy and Goldberg, 2014); (iii) fastText
new CRF model using our training data.                   embeddings, trained on the English Wikipedia
                                                         using sub-word information (Bojanowski et al.,
4.2    Neural Approach                                   2017). By taking into consideration these pre-
We adopted an implementation of BiLSTM-                  trained embeddings, we cover different types of
CRF developed from the Ubiquitous Knowledge              word representation: GloVe is based on linear bag-
Processing Lab (Technische Universität Darm-            of-words contexts, Levy on dependency parse-
stadt)3 . This architecture exploits casing infor-       trees, and fastText on a bag of character n-grams.
mation, character embeddings and word embed-                In addition, we employed word vectors we de-
dings; no feature engineering is required (Reimers       veloped using GloVe, fastText and Levy and Gold-
and Gurevych, 2017a). We chose this imple-               berg’s algorithms on a a subset of the Corpus
mentation because the authors propose recom-             of Historical American English (COHA) (Davies,
mended hyperparameter configurations for several         2012) made of more than 198 million words. The
sequence labelling tasks, including NER, that we         chosen subset contains more than 3,800 texts be-
took as a reference for our own experiments. More        longing to four genres (i.e., fiction, non-fiction,
specifically, the setup suggested by Reimers and         newspaper, magazine) published in the same tem-
                                                         poral span of our corpus of travel writings. These
  2
    https://nlp.stanford.edu/software/
                                                            4
CRF-NER.html                                                https://keras.io/
  3                                                         5
    https://github.com/UKPLab/                              http://deeplearning.net/software/
emnlp2017-bilstm-cnn-crf                                 theano/
historical embeddings, named HistoGlove, Histo-                                       P       R      F1
Fast and HistoLevy, are available online6 .            Stanford NER                   82.1    66.1   73.2
                                                       Retrained Stanford NER         78.9    79.2   79.1
5       Results and Discussion                         Neural HistoLevy               85.3    83.3   84.3
Table 1 shows the results of our experiments in        Neural Levy                    83.7    86.8   85.3
terms of precision (P), recall (R) and F-measure       Neural HistoFast               83.9    87.4   85.6
(F1): the score obtained with the Stanford NER         Neural GloVe                   83.7    87.9   86.0
module before and after the retraining is compared     Neural FastText                86.3    86.3   86.3
with the one achieved with the deep learning ar-       Neural HistoGlove              86.4    88.5   87.4
chitecture and different pre-trained word embed-
                                                              Table 1: Results of the experiments.
dings.
   The neural approach performs remarkably bet-                               Stanford       Neural
ter than the CFR sequence models with a differ-                               NER            HistoGloVe
ence ranging from 11 to 14 points in terms of F1,                             F1             F1
depending on the word vectors used. The orig-             Little Pilgrimage   80.9           90.7
inal Stanford module produces much unbalanced             Naples Riviera      73.3           86.0
results with the lowest recall and F1 but a preci-        Rome                55.6           80.9
sion above 82. In all the other experiments, scores
are more balanced even if in the majority of the      Table 2: Comparison of F1 in the three test files.
neural experiments recall is slightly higher than
precision, meaning that BiLSTM is more able to        presence of many Latin place names and locations
generalise the observations of named entities from    related to the ancient (and even mythological) his-
the training data. Although the training data are     tory of the city of Rome, e.g. Grotto of Lupercus,
few, compared to the corpora used for the orig-       Alba Longa. As displayed in Table 2, Neural His-
inal Stanford NER module, they produce an im-         toGloVe increases the F1 score of 9.8 points on the
provement of 13.1 and 5.9 points on recall and F1     first file, 12.7 on the second and 25.3 on the third.
respectively, demonstrating the positive impact of
having in-domain annotated data.                      6    Conclusions and Future Works
   As for word vectors, dependency-based embed-
                                                      In this paper we presented the application of a neu-
dings are not the best word representation for the
                                                      ral architecture to the automatic identification of
NER task having the lowest F1 among the exper-
                                                      place names in historical texts. We chose to work
iments with the neural architecture. It is worth
                                                      on an under-investigated text genre, namely travel
noticing that GloVe, suggested as the best word
                                                      writings, that presents a set of specific linguistic
vectors by Reimers and Gurevych (2017a) for the
                                                      features making the NER task particularly chal-
NER task on contemporary texts, does not achieve
                                                      lenging. The deep learning approach, combined
the best scores on our historical corpus. Linear
                                                      with in-domain training set and in-domain histori-
bag-of-words contexts is however confirmed as
                                                      cal embeddings, outperforms the linear CRF clas-
the most appropriate word representation for the
                                                      sifier of the Stanford NER module without the
identification of Named Entities, given that His-
                                                      need of performing feature engineering. Anno-
toGloVe produces the highest scores for all the
                                                      tated corpus, best model and historical word vec-
three metrics.
                                                      tors are all freely available online.
   The improvement obtained with the neural ap-
                                                          As for future work, we plan to experiment with
proach combined with historical word vectors and
                                                      a finer-grained classification so to distinguish dif-
in-domain training data is evident when looking
                                                      ferent types of locations. In addition, another as-
in details at the results over the three files con-
                                                      pect worth studying is the georeferencing of iden-
stituting the test set. These texts were extracted
                                                      tified place names so to map the geographical di-
from two travel reports, “A Little Pilgrimage in
                                                      mension of travel writings in Italy. An example
Italy” (1911) and “Naples Riviera” (1907) and one
                                                      of visualisation is given in Figure 1 where the lo-
guidebook, “Rome” (1905). The text taken from
                                                      cations automatically identified from the test file
the latter book is particularly challenging for the
                                                      taken from the book “Naples Riviera” are dis-
    6
        http://bit.do/esiaS                           played: place names have been georeferenced us-
       Figure 1: Map of place names in the Neapolitan area mentioned in the “Naples Riviera” test file.


ing the Geocoding API7 offered by Google and              Lars Borin, Dimitrios Kokkinakis, and Leif-Jöran Ols-
displayed through the Carto8 web mapping tool.              son. 2007. Naming the past: Named entity and ani-
                                                            macy recognition in 19th century Swedish literature.
Another interesting work would be the detection
                                                            In Proceedings of the Workshop on Language Tech-
of itineraries of past travellers: this application         nology for Cultural Heritage Data (LaTeCH 2007),
could have a potential impact on the tourism sec-           pages 1–8.
tor, suggesting historical routes alternative to those
                                                          Peter Burke. 1997. Varieties of cultural history. Cor-
more beaten and congested and making tourists re-           nell University Press.
discovering sites long forgotten.
                                                          Jacob Cohen. 1960. A coefficient of agreement
Acknowledgments                                              for nominal scales. Educational and psychological
                                                             measurement, 20(1):37–46.
The author wants to thank Manuela Speranza for
                                                          Mark Davies. 2012. Expanding horizons in historical
her help with inter-annotator agreement.                   linguistics with the 400-million word Corpus of His-
                                                           torical American English. Corpora, 7(2):121–157.

References                                                George R Doddington, Alexis Mitchell, Mark A Przy-
                                                            bocki, Lance A Ramshaw, Stephanie Strassel, and
Tita Beaven. 2007. A life in the sun: Accounts of new       Ralph M Weischedel. 2004. The Automatic Content
   lives abroad as intercultural narratives. Language       Extraction (ACE) Program-Tasks, Data, and Evalu-
   and Intercultural Communication, 7(3):188–202.           ation. In LREC, volume 2, pages 837–840.

David J Bodenhamer. 2012. The spatial humanities:         Maud Ehrmann, Giovanni Colavizza, Yannick Rochat,
  space, time and place in the new digital age. In His-    and Frédéric Kaplan. 2016. Diachronic evaluation
  tory in the Digital Age, pages 35–50. Routledge.         of NER systems on old newspapers. In Proceedings
                                                           of the 13th Conference on Natural Language Pro-
Piotr Bojanowski, Edouard Grave, Armand Joulin, and        cessing (KONVENS 2016)), number EPFL-CONF-
   Tomas Mikolov. 2017. Enriching Word Vectors             221391, pages 97–107. Bochumer Linguistische Ar-
   with Subword Information. Transactions of the As-       beitsberichte.
   sociation for Computational Linguistics, 5:135–146.
                                                          Safaa Eltyeb and Naomie Salim. 2014. Chemi-
   7                                                        cal named entities recognition: a review on ap-
    https://developers.google.com/maps/
documentation/geocoding/start                               proaches and applications. Journal of cheminfor-
  8
    https://carto.com/                                      matics, 6(1):17.
Jenny Rose Finkel, Trond Grenager, and Christopher        Lucia C Passaro, Alessandro Lenci, and Anna Gab-
   Manning. 2005. Incorporating non-local informa-          bolini. 2017. INFORMed PA: A NER for the Ital-
   tion into information extraction systems by Gibbs        ian Public Administration Domain. In Fourth Italian
   sampling. In Proceedings of the 43rd annual meet-        Conference on Computational Linguistics CLiC-it
   ing on association for computational linguistics,        2017, pages 246–251. Accademia University Press.
   pages 363–370. Association for Computational Lin-
   guistics.                                              Jeffrey Pennington, Richard Socher, and Christopher
                                                             Manning. 2014. Glove: Global vectors for word
Ian Gregory, Christopher Donaldson, Patricia                 representation. In Proceedings of the 2014 confer-
   Murrieta-Flores, and Paul Rayson.       2015.             ence on empirical methods in natural language pro-
   Geoparsing, GIS, and textual analysis: Current            cessing (EMNLP), pages 1532–1543.
   developments in spatial humanities research.
   International Journal of Humanities and Arts           Nils Reimers and Iryna Gurevych. 2017a. Optimal hy-
   Computing, 9(1):1–14.                                    perparameters for deep lstm-networks for sequence
                                                            labeling tasks. arXiv preprint arXiv:1707.06799.
Ralph Grishman and Beth Sundheim. 1996. Mes-
  sage understanding conference-6: A brief history.       Nils Reimers and Iryna Gurevych. 2017b. Report-
  In COLING 1996 Volume 1: The 16th Interna-                ing Score Distributions Makes a Difference: Perfor-
  tional Conference on Computational Linguistics,           mance Study of LSTM-networks for Sequence Tag-
  volume 1.                                                 ging. In Proceedings of the 2017 Conference on
                                                            Empirical Methods in Natural Language Process-
Maryam Habibi, Leon Weber, Mariana Neves,                   ing, pages 338–348.
 David Luis Wiegandt, and Ulf Leser. 2017. Deep
 learning with word embeddings improves biomed-           Yannick Rochat, Maud Ehrmann, Vincent Buntinx,
 ical named entity recognition.   Bioinformatics,           Cyril Bornet, and Frédéric Kaplan. 2016. Navigat-
 33(14):i37–i48.                                            ing through 200 years of historical newspapers. In
                                                            iPRES 2016, number EPFL-CONF-218707.
Alison Jones and Gregory Crane. 2006. The chal-
  lenge of Virginia Banks: an evaluation of named en-     Rachele Sprugnoli, Giovanni Moretti, Sara Tonelli, and
  tity analysis in a 19th-century newspaper collection.     Stefano Menini. 2016. Fifty years of European his-
  In Digital Libraries, 2006. JCDL’06. Proceedings of       tory through the lens of computational linguistics:
  the 6th ACM/IEEE-CS Joint Conference on, pages            the De Gasperi Project. Italian Journal of Computa-
  31–40. IEEE.                                              tional Linguistics, pages 89–100.
Frédéric Kaplan. 2015. A map for big data research in   Rachele Sprugnoli, Sara Tonelli, Giovanni Moretti, and
   digital humanities. Frontiers in Digital Humanities,     Stefano Menini. 2017. A little bit of bella pia-
   2:1.                                                     nura: Detecting Code-Mixing in Historical English
Omer Levy and Yoav Goldberg. 2014. Dependency-              Travel Writing. In Proceedings of the Fourth Italian
 Based Word Embeddings. In ACL (2), pages 302–              Conference on Computational Linguistics (CLiC-it
 308.                                                       2017).

Xuezhe Ma and Eduard Hovy. 2016. End-to-end Se-           Rachele Sprugnoli. 2018. “Two days we have passed
  quence Labeling via Bi-directional LSTM-CNNs-             with the ancients...”: a Digital Resource of Histori-
  CRF. In Proceedings of the 54th Annual Meeting of         cal Travel Writings on Italy. In Book of Abstract of
  the Association for Computational Linguistics (Vol-       AIUCD 2018 Conference. AIUCD.
  ume 1: Long Papers), volume 1, pages 1064–1074.
                                                          Erik F Tjong Kim Sang and Fien De Meulder.
Sunghwan Mac Kim and Steve Cassidy. 2015. Find-              2003. Introduction to the CoNLL-2003 shared task:
  ing names in trove: named entity recognition for           Language-independent named entity recognition. In
  Australian historical newspapers. In Proceedings of        Proceedings of the seventh conference on Natural
  the Australasian Language Technology Association           language learning at HLT-NAACL 2003-Volume 4,
  Workshop 2015, pages 57–65.                                pages 142–147. Association for Computational Lin-
                                                             guistics.
David Nadeau and Satoshi Sekine. 2007. A sur-
  vey of named entity recognition and classification.     Seth Van Hooland, Max De Wilde, Ruben Verborgh,
  Lingvisticae Investigationes, 30(1):3–26.                 Thomas Steiner, and Rik Van de Walle. 2013. Ex-
                                                            ploring entity recognition and disambiguation for
Clemens Neudecker, Lotte Wilms, Wille Jaan Faber,           cultural heritage collections. Digital Scholarship in
  and Theo van Veen. 2014. Large-scale refinement           the Humanities, 30(2):262–279.
  of digital historic newspapers with named entity
  recognition. In Proc IFLA Newspapers/GENLOC
  Pre-Conference Satellite Meeting.
Clemens Neudecker. 2016. An Open Corpus for
  Named Entity Recognition in Historic Newspapers.
  In LREC.