=Paper= {{Paper |id=Vol-2006/paper020 |storemode=property |title=Domain-specific Named Entity Disambiguation in Historical Memoirs |pdfUrl=https://ceur-ws.org/Vol-2006/paper020.pdf |volume=Vol-2006 |authors=Marco Rovera,Federico Nanni,Simone Paolo Ponzetto,Anna Goy |dblpUrl=https://dblp.org/rec/conf/clic-it/RoveraNPG17 }} ==Domain-specific Named Entity Disambiguation in Historical Memoirs== https://ceur-ws.org/Vol-2006/paper020.pdf
    Domain-specific Named Entity Disambiguation in Historical Memoirs

           Marco Rovera1 , Federico Nanni2 , Simone Paolo Ponzetto2 , Anna Goy1
                   1
                     Dipartimento di Informatica, Università di Torino, Italy
                             {rovera,goy}@di.unito.it
             2
               Data and Web Science Group, University of Mannheim, Germany
              {federico,simone}@informatik.uni-mannheim.de


                    Abstract                         knowledge) for automatic systems. Besides en-
                                                     cyclopedic resources, libraries and archives pro-
    English. This paper presents the results         vide many different types of texts, often spanning
    of the extraction of named entities from a       very specific geographical, individual or thematic
    collection of historical memoirs about the       contexts, for which current knowledge extraction
    italian Resistance during the World War          systems may lack the suitable information. Nev-
    II. The methodology followed for the ex-         ertheless, the tasks of extracting, disambiguating
    traction and disambiguation task will be         and linking information provided by historical tex-
    discussed, as well as its evaluation. For        tual documents with respect to external knowl-
    the semantic annotations of the dataset, we      edge bases is still a crucial step towards automatic
    have developed a pipeline based on estab-        access to written resources and for further em-
    lished practices for extracting and disam-       ploy of such knowledge in end-user applications
    biguating Named Entities. This has been          (e.g. navigation, rich semantic search, creation
    necessary, considering the poor perfor-          of narrative chains). In order to address longer
    mances of out-of-the-box Named Entity            term tasks, such as event extraction from histori-
    Recognition and Disambiguation (NERD)            cal texts (Goy et al., 2015), we first addressed the
    tools tested in the initial phase of this        task of extracting and disambiguating Named En-
    work.                                            tities (Persons, Locations and Organizations) from
    Italiano.       Questo articolo presenta         a corpus of historical memories of the “Libera-
    l’attività di estrazione di entità nominate      tion War” in Italy, during the Second World War.
    realizzata su una collezione di memo-            Due to the specificity of the domain and of the
    rie relative al periodo della Resistenza         involved entities, state-of-the-art tools for Named
    italiana nella Seconda Guerra Mondiale.          Entity Recognition and Disambiguation show low
    Verrà discussa la metodologia sviluppata         performances, thus suggesting us to try to achieve
    per il processo di estrazione e disam-           our goal using a different approach. In this pa-
    biguazione delle entità nominate, nonché         per we present a collection of documents created
    la sua valutazione. L’implementazione            by digitizing historical memoirs, together with an
    di una metodologia di estrazione e dis-          overview of the methodology we followed for the
    ambiguazione basata su lookup si è resa          extraction and disambiguation of Persons, Loca-
    necessaria in considerazione delle scarse        tions and Organizations, as well as the results of
    prestazioni dei sistemi di Named Entity          the evaluation of its output in comparison with the
    Recognition and Disambiguation (NERD),           output of two state-of-the-art systems. The out-
    come si evince dalla discussione nella           line of the paper is the following: in Section 2
    prima parte di questo lavoro.                    some related projects are discussed, while in Sec-
                                                     tion 3 the dataset used in the experiment is pre-
                                                     sented. Section 4 describes the test of two auto-
1   Introduction and Motivation                      matic NER tools (4.1) and the methodology de-
                                                     vised for our experiment (4.2). In Section 5 the
Current NLP techniques allow us to treat some
                                                     results of the evaluation are discussed, while Sec-
types of historical textual resources provided by,
                                                     tion 6 concludes the paper and outlines the next
among others, historical archives and libraries,
                                                     developments of the project.
as a source of information (and, in prospect, of
2   Related Work                                         ian, that have been digitized using standard OCR
                                                         techniques, overall counting over 855,000 words
The work described in this paper is mainly related
                                                         (about 45,000 sentences). The documents are
to Named Entity Recognition and Disambiguation
                                                         historical memoirs of Italian partisans from the
(NERD) techniques and their application in the
                                                         WWII. More specifically, the covered time span
field of Digital Humanities (DH), in particular on
                                                         goes from the 8th September 1943 to the 25th
historical texts. While NER refers to the task of
                                                         April 1945, a period known in the Italian histo-
identifying named entities in text and classifying
                                                         riography as “Resistenza” (Resistance). The geo-
them according to a set of categories, a Named
                                                         graphic area encompassed by the narrated events
Entity Disambiguation (NED) task is aimed at as-
                                                         is the south-western part of the Alps in Piemonte,
signing a correspondence between an ambiguous
                                                         Italy, with some minor exceptions. The texts have
surface form and the individual entity it refers to.
                                                         been intentionally selected for digitization for hav-
Although analytically they can be considered as
                                                         ing a partial but significant overlap in terms of
two separate tasks, the current availability of large,
                                                         narrated events, as well as of places and involved
publicly accessible knowledge bases allowed to
                                                         people. None of the 15 documents presents any
merge them into the task of Entity Linking (EL),
                                                         semantic annotation. Beside the digitization of
which aims at linking a surface form from a text
                                                         the documents, three gazetteers have been created:
to the corresponding entry in a resource like DB-
                                                         the first one, containing names of persons (1820
pedia or Wikipedia (Barrière, 2016). A recent ap-
                                                         entries), has been populated using name indexes
plication of EL techniques in a DH context is pre-
                                                         provided by 6 of the texts, while the gazetteers
sented in Brando et al. (2016), where the authors
                                                         containing toponyms and names of organizations
use a graph-based approach and exploit Linked
                                                         (1140 and 190 entries, respectively) have been
Data for linking mentions of writers in a corpus of
                                                         built manually during the digitization activities.
French literary criticism and scientific essays. Dis-
                                                         The setting of our work is partly determined by
cussions and experiments on the use of third-party
                                                         some features of the textual resources under analy-
NER services on historical OCRed texts (typewrit-
                                                         sis, in particular: 1) due to the specificity of the do-
ten memoirs of Holocaust survivors and old news-
                                                         main, only 4% of the persons in the gazetteer are
papers respectively) are provided by Rodriquez et
                                                         available in the italian Wikipedia (according to a
al. (2012) and by Ehrmann et al. (2016), offer-
                                                         manual check carried out on the whole gazetteer);
ing a starting point for our work, since they quan-
                                                         the same problem holds for organizations and, to
tify, showing their limitations, the performances of
                                                         a smaller extent, for toponyms; 2) while for en-
NER such tools on specific historical texts (as also
                                                         tities of type Location (LOC) and Organization
remarked in Nanni et al. (2017)). Also in the Ital-
                                                         (ORG) the mining process involves usual prob-
ian DH research community, the interest for min-
                                                         lems (abbreviations, upper vs lowercase mention,
ing historical texts became more evident in the last
                                                         ambiguity due to the same surface form), with Per-
years and leading to several interesting works. In
                                                         son (PER) entities the domain at hand presents a
Boschetti et al. (2014), for example, the authors
                                                         further issue as it was quite common, among the
describe the ongoing work of applying a full In-
                                                         partisans, to use aliases, or nom de guerre. This
formation Extraction pipeline (from OCR digitiza-
                                                         feature is showed by 32% of the occurrencies in
tion to data visualization) to war bulletins in WWI
                                                         our PER gazetteer (often the most prominent ones
and WWII and discuss the issues they addressed
                                                         in the narrated events). This means that in text
in adapting existing tools to dated and domain-
                                                         persons are to be found under different combi-
specific language. Another related project with a
                                                         nations of name, surname and nickname. While
similar setting is ALCIDE, described in Moretti
                                                         in some cases this additional information makes
et al. (2016), a platform that supports the use of
                                                         the disambiguation process easier, in many other
text mining techniques for the navigation and vi-
                                                         cases it may represent an additional source of am-
sualization of information in historical and literary
                                                         biguity. The PER gazetteer is structured in three
texts.
                                                         fields, namely Name, Surname and Alias, that are
3   Dataset                                              later combined into patterns (see section 4.2); con-
                                                         versely, in the ORG and LOC gazetteers, for each
The collection of documents used in this work is         entry all the possible lexical forms are listed (for
composed by 15 printed books, written in Ital-
                       Recognition (%)                   as Chiaffredo Barreri «Tormenta»).
                     PER LOC ORG
           NERD      0.66 0.70 0.51                      4.2   Methodology
                                                         The mining process initially took the form of a
                         Linking (%)
                                                         simple string matching in text, based on the entries
                     PER LOC ORG
                                                         provided by the gazetteers. However, due to the
          TagMe      0.05 0.45 0.37
                                                         different ways each entity type can appear in text
          NERD       0.03 0.47 0.27
                                                         - as discussed in Section 3 - two different strate-
                                                         gies have been implemented: string matching with
Table 1: Evaluation using TagMe and NERD (Per-
                                                         some refinements for LOC and ORG entity types
centage of correctly linked occurrencies over a
                                                         and a slightly more elaborated strategy for PER
sample of 200 sentences).
                                                         entities, based on co-occurrence statistics derived
                                                         directly from the corpus under study.
the Italian Action Party, for example, we will have:     PER entities. Based on the manual analysis of
Partito d’Azione, PdA, Pd’A, P.d.A. and so on).          the documents, 15 lexical patterns have been ob-
                                                         served, through which proper names of partisans
4     Experiment                                         appear in text; frequent occurring patterns are for
                                                         example “Name Surname (Alias)”, like in “Gus-
4.1    Test of existing automatic NERD tools
                                                         tavo Comollo (Pietro)”, Name «Alias» Surname,
In order to clarify the need for an ad hoc extrac-       like in “Gustavo «Pietro» Comollo”, or “Alias
tion and disambiguation approach for our texts,          Surname”, like in “Pietro Comollo”. Each of these
we first tried state-of-the-art NERD tools; we ran-      15 patterns have been automatically instantiated
domly selected 200 sentences from the corpus and         for each entry of the gazetteer. This resulted in
annotated them with NERD (Rizzo and Troncy,              a dictionary of instantiated patterns that have been
2012), a framework that aggregates the results           used directly for the string matching step in text.
from different NER systems (Alchemy API, DB-             Since a certain degree of ambiguity (homonymy)
pedia Spotlight, TextRazor, Zemanta among oth-           is present in the gazetteer, where many entries
ers), and TagMe (Ferragina and Scaiella, 2010), an       share the same name or surname or alias, for each
entity linker to Wikipedia available also for Italian.   instance of the patterns in the dictionary an am-
Table 1 shows the percentage of correctly recog-         biguity value has been computed, keeping track,
nized (i.e. classified) and linked occurrences ob-       for the ambiguous instances, of all the possible in-
tained as result by the two systems. Since TagMe         dividuals they may actually refer to. For exam-
does not separate the two tasks of Recognition and       ple, the pattern instance “«Renzo»”, that in ital-
Linking, for this system we only report the Link-        ian can be both a name and an alias, has been
ing results. In the recognition task, NERD per-          connected to all the entries in the gazetteer where
formances are quite good for Persons and Loca-           “Renzo” appears either as name or as alias, which
tions, while they drop with Organizations. As we         become candidates for that specific occurrence.
turn to the linking task, we observe how the trend       Then the string matching in text has been per-
in the results is similar in the two systems: per-       formed. Within the found occurrences, we sep-
formances are very low in the case of Persons,           arated the unambiguous occurrences (those who
while they improve in the case of Locations and          refer to only one entry in the gazetteer), that
remain quite low for Organizations. This result          have been considered as true positives and did not
can partly be explained by the degree of (spatial        require further processing, from the ambiguous
and social) specificity of the entities that are to be   ones, for which a disambiguation step is needed.
found in the corpus: state-of-the-art tools perform      Only considering the unambiguous mentions re-
good on prominent entities (for example “Benito          trieved this way, the system scored a precision
Mussolini”), but large-scale knowledge bases lack        measure of .98 (see Section 5), so we used this
the suitable knowledge for specific contexts, like       set of occurrencies as grounding space for the dis-
those that are more often to be found in the his-        ambiguation step. At this point the system has dis-
torical memoirs under analysis (and thus NERD            ambiguated 55.8% (9268) of the PER occurrences
systems are not able to link specific entities, such     in the corpus, while 44.2% (7341) of the occur-
rences remain ambiguous (for precision and recall                             Lookup Search
scores, see Table 2, “Lookup Search”). In order                    Recall    Precision       F1
to disambiguate the remaining occurrences differ-          PER     0.716       0.980        0.827
ent heuristics have been explored. Based on the            LOC     0.954       0.917        0.935
literature, we tried to apply to the Named Entity          ORG     0.987       0.991        0.989
Disambiguation task the “one sense per discourse”
hypothesis, as done by the authors in (Barrena et                  Lookup Search and Disambiguation
al., 2014). Other two heuristics have been ex-                     Recall Precision        F1
plored, that we can informally designate as Last            PER    0.751    0.965         0.845
Mentioned and Most Mentioned. Given an am-
biguous occurrence recognized in text, the former          Table 2: Evaluation of the presented pipeline.
one links the occurrence to the last already dis-
ambiguated corresponding candidate. Following
                                                       NE categories (for example the name “Leonardo
from the example above, if we find the pattern
                                                       Cocito” in the ORG entity “Battaglione Leonardo
“«Renzo»” in text, which is ambiguous and cor-
                                                       Cocito”). In such cases always the longer string
responds to more candidates from the gazetteer,
                                                       has been chosen.
the system links the mention to the same candidate
as the immediately preceding occurrence of this        5    Evaluation
mention. The Most Mentioned rule, conversely,
assigns to an ambiguous occurrence the candidate       The performances of the system have been eval-
which obtained the highest number of mentions          uated against a manually annotated gold standard
in the document. None of these strategies suc-         made of 1,000 sentences. The gold standard has
ceeded in improving the performance of the sys-        been built: a) preserving the relative size of each
tem and this seems to be at least partly due to the    document with respect to the whole corpus size
length of the documents and to the high ambigu-        and b) randomly selecting the sentences in a short
ity degree of some entries (consider that the entry    list that only contains sentences longer than 60
“Renzo” alone has 20 candidates in the dictionary,     characters and with at least 3 capital letters (which
and there are other more ambiguous entries). A         is expected to maximize the probability to have a
promising strategy for the NED task has been in-       NE in the sentence). In the resulting gold stan-
dividuated using co-occurrence frequencies (Shen       dard, 1996 entities (belonging to the three men-
et al., 2015; Hachey et al., 2013). Still based on     tioned categories) have been annotated as true pos-
the unambiguous occurrences, for each entry in         itives by a single human annotator. The results
the PER gazetteer a co-occurrence score has been       of the evaluation are presented in Table 2. The
computed with all the other entities, including Lo-    co-occurrence approach discussed above allows to
cations and Organizations, at corpus level. The        gain coverage without losing too much in terms
co-occurrence has been considered with other en-       of precision and even if the overall gain is small,
tities in the span of 10 sentences, in terms of raw    the approach shows improvements where other ap-
frequency. Then, given an ambiguous mention and        proaches resulted ineffective. The main source
its local context of 10 sentences, the co-occurrence   of improvement is that, being computed at cor-
score has been computed for each of its candi-         pus level, the co-occurrence approach embodies
dates, and the candidate with the highest score has    the occurrence information from all the texts, thus
been assigned to the mention. This strategy allows     going beyond the document level; this proves to be
to further disambiguate 10.6% (1764) of the oc-        effective when an entity does not appear in unam-
currences, with precision and recall scores as in-     biguous form in the document at hand but does in
dicated in Table 2 (“Lookup Search and Disam-          other documents of the collection. One limit of the
biguation”).                                           approach emerges when an entity never appears in
                                                       unambiguous form in the whole corpus, since the
LOC and ORG entities. For entities of type Lo-         grounding space is uniquely based on the set of un-
cation and Organization only the search step has       ambiguous mentions harvested in the search step.
been implemented, not the disambiguation one.          Unfortunately this is often the case when mem-
However, a cross cleaning has been performed,          oirs are concerned: many of the authors are non
eliminating nested mentions belonging to different     professional writers and do not always provide the
full name of the persons they introduce.                     international conference on Information and knowl-
                                                             edge management, pages 1625–1628. ACM.
6   Conclusions and Future Works
                                                           Anna Goy, Diego Magro, and Marco Rovera. 2015.
In this paper we presented an ongoing work aimed             Ontologies and historical archives: a way to tell new
                                                             stories. Applied Ontology, 10(3-4):331–338.
at performing Named Entity Disambiguation on a
digitized historical corpus, along with the results        Ben Hachey, Will Radford, Joel Nothman, Matthew
of the evaluation. Further steps will be a) the              Honnibal, and James R Curran. 2013. Evaluating
                                                             entity linking with wikipedia. Artificial intelligence,
refinement of the presented method by means of
                                                             194:130–150.
weighting measures on co-occurrence and possi-
bly of feature optimization techniques, b) the ap-         Giovanni Moretti, Rachele Sprugnoli, Stefano Menini,
plication of the tested disambiguation strategy also         and Sara Tonelli. 2016. Alcide: Extracting and vi-
                                                             sualising content from large document collections to
to LOC and ORG entities, as well as the study                support humanities studies. Knowledge-Based Sys-
of a cross-category disambiguation strategy, and             tems, 111:100–112.
finally c) the extension of the corpus and of the
                                                           Federico Nanni, Yang Zhao, Simone Paolo Ponzetto,
gazetteers in order to obtain a larger coverage of
                                                             and Laura Dietz. 2017. Enhancing domain-specific
the domain. Furthermore, this work represents the            entity linking in DH. Book of Abstracts of Digital
first step for extracting events and their partici-          Humanities, 2:67–88.
pants from the presented corpus.
                                                           Giuseppe Rizzo and Raphaël Troncy. 2012. NERD:
                                                             a framework for unifying named entity recognition
                                                             and disambiguation extraction tools. In Proceed-
References                                                   ings of the Demonstrations at the 13th Conference
Ander Barrena, Eneko Agirre, Bernardo Cabaleiro,             of the European Chapter of the Association for Com-
  Anselmo Penas, and Aitor Soroa. 2014. One entity           putational Linguistics, pages 73–76. Association for
  per discourse and one entity per collocation improve       Computational Linguistics.
  named-entity disambiguation. In COLING, pages            Kepa Joseba Rodriquez, Mike Bryant, Tobias Blanke,
  2260–2269.                                                 and Magdalena Luszczynska. 2012. Comparison of
                                                             named entity recognition tools for raw ocr text. In
Caroline Barrière. 2016. Natural Language Under-
                                                             KONVENS, pages 410–414.
  standing in a Semantic Web Context. Springer.
                                                           Wei Shen, Jianyong Wang, and Jiawei Han. 2015. En-
Federico Boschetti,       Andrea Cimino,       Felice
                                                             tity linking with a knowledge base: Issues, tech-
  Dell’Orletta, Gianluca E Lebani, Lucia Pas-
                                                             niques, and solutions. IEEE Transactions on Knowl-
  saro, Paolo Picchi, Giulia Venturi, Simonetta
                                                             edge and Data Engineering, 27(2):443–460.
  Montemagni, and Alessandro Lenci. 2014. Com-
  putational analysis of historical documents: An
  application to italian war bulletins in world war I
  and II. In Proceedings of LREC 2014 workshop on
  Language resources and technologies for process-
  ing and linking historical documents and archives
  - deploying linked open data in cultural heritage
  (LRT4HDA 2014).

Carmen Brando, Francesca Frontini, and Jean-Gabriel
  Ganascia. 2016. Reden: named entity linking in
  digital literary editions using linked data sets. Com-
  plex Systems Informatics and Modeling Quarterly,
  (7):60–80.

Maud Ehrmann, Giovanni Colavizza, Yannick Rochat,
 and Frédéric Kaplan. 2016. Diachronic evaluation
 of ner systems on old newspapers. In Proceedings
 of the 13th Conference on Natural Language Pro-
 cessing (KONVENS 2016)), number EPFL-CONF-
 221391, pages 97–107. Bochumer Linguistische Ar-
 beitsberichte.

Paolo Ferragina and Ugo Scaiella. 2010. Tagme:
  on-the-fly annotation of short text fragments (by
  wikipedia entities). In Proceedings of the 19th ACM