=Paper=
{{Paper
|id=Vol-2006/paper020
|storemode=property
|title=Domain-specific Named Entity Disambiguation in Historical Memoirs
|pdfUrl=https://ceur-ws.org/Vol-2006/paper020.pdf
|volume=Vol-2006
|authors=Marco Rovera,Federico Nanni,Simone Paolo Ponzetto,Anna Goy
|dblpUrl=https://dblp.org/rec/conf/clic-it/RoveraNPG17
}}
==Domain-specific Named Entity Disambiguation in Historical Memoirs==
Domain-specific Named Entity Disambiguation in Historical Memoirs
Marco Rovera1 , Federico Nanni2 , Simone Paolo Ponzetto2 , Anna Goy1
1
Dipartimento di Informatica, Università di Torino, Italy
{rovera,goy}@di.unito.it
2
Data and Web Science Group, University of Mannheim, Germany
{federico,simone}@informatik.uni-mannheim.de
Abstract knowledge) for automatic systems. Besides en-
cyclopedic resources, libraries and archives pro-
English. This paper presents the results vide many different types of texts, often spanning
of the extraction of named entities from a very specific geographical, individual or thematic
collection of historical memoirs about the contexts, for which current knowledge extraction
italian Resistance during the World War systems may lack the suitable information. Nev-
II. The methodology followed for the ex- ertheless, the tasks of extracting, disambiguating
traction and disambiguation task will be and linking information provided by historical tex-
discussed, as well as its evaluation. For tual documents with respect to external knowl-
the semantic annotations of the dataset, we edge bases is still a crucial step towards automatic
have developed a pipeline based on estab- access to written resources and for further em-
lished practices for extracting and disam- ploy of such knowledge in end-user applications
biguating Named Entities. This has been (e.g. navigation, rich semantic search, creation
necessary, considering the poor perfor- of narrative chains). In order to address longer
mances of out-of-the-box Named Entity term tasks, such as event extraction from histori-
Recognition and Disambiguation (NERD) cal texts (Goy et al., 2015), we first addressed the
tools tested in the initial phase of this task of extracting and disambiguating Named En-
work. tities (Persons, Locations and Organizations) from
Italiano. Questo articolo presenta a corpus of historical memories of the “Libera-
l’attività di estrazione di entità nominate tion War” in Italy, during the Second World War.
realizzata su una collezione di memo- Due to the specificity of the domain and of the
rie relative al periodo della Resistenza involved entities, state-of-the-art tools for Named
italiana nella Seconda Guerra Mondiale. Entity Recognition and Disambiguation show low
Verrà discussa la metodologia sviluppata performances, thus suggesting us to try to achieve
per il processo di estrazione e disam- our goal using a different approach. In this pa-
biguazione delle entità nominate, nonché per we present a collection of documents created
la sua valutazione. L’implementazione by digitizing historical memoirs, together with an
di una metodologia di estrazione e dis- overview of the methodology we followed for the
ambiguazione basata su lookup si è resa extraction and disambiguation of Persons, Loca-
necessaria in considerazione delle scarse tions and Organizations, as well as the results of
prestazioni dei sistemi di Named Entity the evaluation of its output in comparison with the
Recognition and Disambiguation (NERD), output of two state-of-the-art systems. The out-
come si evince dalla discussione nella line of the paper is the following: in Section 2
prima parte di questo lavoro. some related projects are discussed, while in Sec-
tion 3 the dataset used in the experiment is pre-
sented. Section 4 describes the test of two auto-
1 Introduction and Motivation matic NER tools (4.1) and the methodology de-
vised for our experiment (4.2). In Section 5 the
Current NLP techniques allow us to treat some
results of the evaluation are discussed, while Sec-
types of historical textual resources provided by,
tion 6 concludes the paper and outlines the next
among others, historical archives and libraries,
developments of the project.
as a source of information (and, in prospect, of
2 Related Work ian, that have been digitized using standard OCR
techniques, overall counting over 855,000 words
The work described in this paper is mainly related
(about 45,000 sentences). The documents are
to Named Entity Recognition and Disambiguation
historical memoirs of Italian partisans from the
(NERD) techniques and their application in the
WWII. More specifically, the covered time span
field of Digital Humanities (DH), in particular on
goes from the 8th September 1943 to the 25th
historical texts. While NER refers to the task of
April 1945, a period known in the Italian histo-
identifying named entities in text and classifying
riography as “Resistenza” (Resistance). The geo-
them according to a set of categories, a Named
graphic area encompassed by the narrated events
Entity Disambiguation (NED) task is aimed at as-
is the south-western part of the Alps in Piemonte,
signing a correspondence between an ambiguous
Italy, with some minor exceptions. The texts have
surface form and the individual entity it refers to.
been intentionally selected for digitization for hav-
Although analytically they can be considered as
ing a partial but significant overlap in terms of
two separate tasks, the current availability of large,
narrated events, as well as of places and involved
publicly accessible knowledge bases allowed to
people. None of the 15 documents presents any
merge them into the task of Entity Linking (EL),
semantic annotation. Beside the digitization of
which aims at linking a surface form from a text
the documents, three gazetteers have been created:
to the corresponding entry in a resource like DB-
the first one, containing names of persons (1820
pedia or Wikipedia (Barrière, 2016). A recent ap-
entries), has been populated using name indexes
plication of EL techniques in a DH context is pre-
provided by 6 of the texts, while the gazetteers
sented in Brando et al. (2016), where the authors
containing toponyms and names of organizations
use a graph-based approach and exploit Linked
(1140 and 190 entries, respectively) have been
Data for linking mentions of writers in a corpus of
built manually during the digitization activities.
French literary criticism and scientific essays. Dis-
The setting of our work is partly determined by
cussions and experiments on the use of third-party
some features of the textual resources under analy-
NER services on historical OCRed texts (typewrit-
sis, in particular: 1) due to the specificity of the do-
ten memoirs of Holocaust survivors and old news-
main, only 4% of the persons in the gazetteer are
papers respectively) are provided by Rodriquez et
available in the italian Wikipedia (according to a
al. (2012) and by Ehrmann et al. (2016), offer-
manual check carried out on the whole gazetteer);
ing a starting point for our work, since they quan-
the same problem holds for organizations and, to
tify, showing their limitations, the performances of
a smaller extent, for toponyms; 2) while for en-
NER such tools on specific historical texts (as also
tities of type Location (LOC) and Organization
remarked in Nanni et al. (2017)). Also in the Ital-
(ORG) the mining process involves usual prob-
ian DH research community, the interest for min-
lems (abbreviations, upper vs lowercase mention,
ing historical texts became more evident in the last
ambiguity due to the same surface form), with Per-
years and leading to several interesting works. In
son (PER) entities the domain at hand presents a
Boschetti et al. (2014), for example, the authors
further issue as it was quite common, among the
describe the ongoing work of applying a full In-
partisans, to use aliases, or nom de guerre. This
formation Extraction pipeline (from OCR digitiza-
feature is showed by 32% of the occurrencies in
tion to data visualization) to war bulletins in WWI
our PER gazetteer (often the most prominent ones
and WWII and discuss the issues they addressed
in the narrated events). This means that in text
in adapting existing tools to dated and domain-
persons are to be found under different combi-
specific language. Another related project with a
nations of name, surname and nickname. While
similar setting is ALCIDE, described in Moretti
in some cases this additional information makes
et al. (2016), a platform that supports the use of
the disambiguation process easier, in many other
text mining techniques for the navigation and vi-
cases it may represent an additional source of am-
sualization of information in historical and literary
biguity. The PER gazetteer is structured in three
texts.
fields, namely Name, Surname and Alias, that are
3 Dataset later combined into patterns (see section 4.2); con-
versely, in the ORG and LOC gazetteers, for each
The collection of documents used in this work is entry all the possible lexical forms are listed (for
composed by 15 printed books, written in Ital-
Recognition (%) as Chiaffredo Barreri «Tormenta»).
PER LOC ORG
NERD 0.66 0.70 0.51 4.2 Methodology
The mining process initially took the form of a
Linking (%)
simple string matching in text, based on the entries
PER LOC ORG
provided by the gazetteers. However, due to the
TagMe 0.05 0.45 0.37
different ways each entity type can appear in text
NERD 0.03 0.47 0.27
- as discussed in Section 3 - two different strate-
gies have been implemented: string matching with
Table 1: Evaluation using TagMe and NERD (Per-
some refinements for LOC and ORG entity types
centage of correctly linked occurrencies over a
and a slightly more elaborated strategy for PER
sample of 200 sentences).
entities, based on co-occurrence statistics derived
directly from the corpus under study.
the Italian Action Party, for example, we will have: PER entities. Based on the manual analysis of
Partito d’Azione, PdA, Pd’A, P.d.A. and so on). the documents, 15 lexical patterns have been ob-
served, through which proper names of partisans
4 Experiment appear in text; frequent occurring patterns are for
example “Name Surname (Alias)”, like in “Gus-
4.1 Test of existing automatic NERD tools
tavo Comollo (Pietro)”, Name «Alias» Surname,
In order to clarify the need for an ad hoc extrac- like in “Gustavo «Pietro» Comollo”, or “Alias
tion and disambiguation approach for our texts, Surname”, like in “Pietro Comollo”. Each of these
we first tried state-of-the-art NERD tools; we ran- 15 patterns have been automatically instantiated
domly selected 200 sentences from the corpus and for each entry of the gazetteer. This resulted in
annotated them with NERD (Rizzo and Troncy, a dictionary of instantiated patterns that have been
2012), a framework that aggregates the results used directly for the string matching step in text.
from different NER systems (Alchemy API, DB- Since a certain degree of ambiguity (homonymy)
pedia Spotlight, TextRazor, Zemanta among oth- is present in the gazetteer, where many entries
ers), and TagMe (Ferragina and Scaiella, 2010), an share the same name or surname or alias, for each
entity linker to Wikipedia available also for Italian. instance of the patterns in the dictionary an am-
Table 1 shows the percentage of correctly recog- biguity value has been computed, keeping track,
nized (i.e. classified) and linked occurrences ob- for the ambiguous instances, of all the possible in-
tained as result by the two systems. Since TagMe dividuals they may actually refer to. For exam-
does not separate the two tasks of Recognition and ple, the pattern instance “«Renzo»”, that in ital-
Linking, for this system we only report the Link- ian can be both a name and an alias, has been
ing results. In the recognition task, NERD per- connected to all the entries in the gazetteer where
formances are quite good for Persons and Loca- “Renzo” appears either as name or as alias, which
tions, while they drop with Organizations. As we become candidates for that specific occurrence.
turn to the linking task, we observe how the trend Then the string matching in text has been per-
in the results is similar in the two systems: per- formed. Within the found occurrences, we sep-
formances are very low in the case of Persons, arated the unambiguous occurrences (those who
while they improve in the case of Locations and refer to only one entry in the gazetteer), that
remain quite low for Organizations. This result have been considered as true positives and did not
can partly be explained by the degree of (spatial require further processing, from the ambiguous
and social) specificity of the entities that are to be ones, for which a disambiguation step is needed.
found in the corpus: state-of-the-art tools perform Only considering the unambiguous mentions re-
good on prominent entities (for example “Benito trieved this way, the system scored a precision
Mussolini”), but large-scale knowledge bases lack measure of .98 (see Section 5), so we used this
the suitable knowledge for specific contexts, like set of occurrencies as grounding space for the dis-
those that are more often to be found in the his- ambiguation step. At this point the system has dis-
torical memoirs under analysis (and thus NERD ambiguated 55.8% (9268) of the PER occurrences
systems are not able to link specific entities, such in the corpus, while 44.2% (7341) of the occur-
rences remain ambiguous (for precision and recall Lookup Search
scores, see Table 2, “Lookup Search”). In order Recall Precision F1
to disambiguate the remaining occurrences differ- PER 0.716 0.980 0.827
ent heuristics have been explored. Based on the LOC 0.954 0.917 0.935
literature, we tried to apply to the Named Entity ORG 0.987 0.991 0.989
Disambiguation task the “one sense per discourse”
hypothesis, as done by the authors in (Barrena et Lookup Search and Disambiguation
al., 2014). Other two heuristics have been ex- Recall Precision F1
plored, that we can informally designate as Last PER 0.751 0.965 0.845
Mentioned and Most Mentioned. Given an am-
biguous occurrence recognized in text, the former Table 2: Evaluation of the presented pipeline.
one links the occurrence to the last already dis-
ambiguated corresponding candidate. Following
NE categories (for example the name “Leonardo
from the example above, if we find the pattern
Cocito” in the ORG entity “Battaglione Leonardo
“«Renzo»” in text, which is ambiguous and cor-
Cocito”). In such cases always the longer string
responds to more candidates from the gazetteer,
has been chosen.
the system links the mention to the same candidate
as the immediately preceding occurrence of this 5 Evaluation
mention. The Most Mentioned rule, conversely,
assigns to an ambiguous occurrence the candidate The performances of the system have been eval-
which obtained the highest number of mentions uated against a manually annotated gold standard
in the document. None of these strategies suc- made of 1,000 sentences. The gold standard has
ceeded in improving the performance of the sys- been built: a) preserving the relative size of each
tem and this seems to be at least partly due to the document with respect to the whole corpus size
length of the documents and to the high ambigu- and b) randomly selecting the sentences in a short
ity degree of some entries (consider that the entry list that only contains sentences longer than 60
“Renzo” alone has 20 candidates in the dictionary, characters and with at least 3 capital letters (which
and there are other more ambiguous entries). A is expected to maximize the probability to have a
promising strategy for the NED task has been in- NE in the sentence). In the resulting gold stan-
dividuated using co-occurrence frequencies (Shen dard, 1996 entities (belonging to the three men-
et al., 2015; Hachey et al., 2013). Still based on tioned categories) have been annotated as true pos-
the unambiguous occurrences, for each entry in itives by a single human annotator. The results
the PER gazetteer a co-occurrence score has been of the evaluation are presented in Table 2. The
computed with all the other entities, including Lo- co-occurrence approach discussed above allows to
cations and Organizations, at corpus level. The gain coverage without losing too much in terms
co-occurrence has been considered with other en- of precision and even if the overall gain is small,
tities in the span of 10 sentences, in terms of raw the approach shows improvements where other ap-
frequency. Then, given an ambiguous mention and proaches resulted ineffective. The main source
its local context of 10 sentences, the co-occurrence of improvement is that, being computed at cor-
score has been computed for each of its candi- pus level, the co-occurrence approach embodies
dates, and the candidate with the highest score has the occurrence information from all the texts, thus
been assigned to the mention. This strategy allows going beyond the document level; this proves to be
to further disambiguate 10.6% (1764) of the oc- effective when an entity does not appear in unam-
currences, with precision and recall scores as in- biguous form in the document at hand but does in
dicated in Table 2 (“Lookup Search and Disam- other documents of the collection. One limit of the
biguation”). approach emerges when an entity never appears in
unambiguous form in the whole corpus, since the
LOC and ORG entities. For entities of type Lo- grounding space is uniquely based on the set of un-
cation and Organization only the search step has ambiguous mentions harvested in the search step.
been implemented, not the disambiguation one. Unfortunately this is often the case when mem-
However, a cross cleaning has been performed, oirs are concerned: many of the authors are non
eliminating nested mentions belonging to different professional writers and do not always provide the
full name of the persons they introduce. international conference on Information and knowl-
edge management, pages 1625–1628. ACM.
6 Conclusions and Future Works
Anna Goy, Diego Magro, and Marco Rovera. 2015.
In this paper we presented an ongoing work aimed Ontologies and historical archives: a way to tell new
stories. Applied Ontology, 10(3-4):331–338.
at performing Named Entity Disambiguation on a
digitized historical corpus, along with the results Ben Hachey, Will Radford, Joel Nothman, Matthew
of the evaluation. Further steps will be a) the Honnibal, and James R Curran. 2013. Evaluating
entity linking with wikipedia. Artificial intelligence,
refinement of the presented method by means of
194:130–150.
weighting measures on co-occurrence and possi-
bly of feature optimization techniques, b) the ap- Giovanni Moretti, Rachele Sprugnoli, Stefano Menini,
plication of the tested disambiguation strategy also and Sara Tonelli. 2016. Alcide: Extracting and vi-
sualising content from large document collections to
to LOC and ORG entities, as well as the study support humanities studies. Knowledge-Based Sys-
of a cross-category disambiguation strategy, and tems, 111:100–112.
finally c) the extension of the corpus and of the
Federico Nanni, Yang Zhao, Simone Paolo Ponzetto,
gazetteers in order to obtain a larger coverage of
and Laura Dietz. 2017. Enhancing domain-specific
the domain. Furthermore, this work represents the entity linking in DH. Book of Abstracts of Digital
first step for extracting events and their partici- Humanities, 2:67–88.
pants from the presented corpus.
Giuseppe Rizzo and Raphaël Troncy. 2012. NERD:
a framework for unifying named entity recognition
and disambiguation extraction tools. In Proceed-
References ings of the Demonstrations at the 13th Conference
Ander Barrena, Eneko Agirre, Bernardo Cabaleiro, of the European Chapter of the Association for Com-
Anselmo Penas, and Aitor Soroa. 2014. One entity putational Linguistics, pages 73–76. Association for
per discourse and one entity per collocation improve Computational Linguistics.
named-entity disambiguation. In COLING, pages Kepa Joseba Rodriquez, Mike Bryant, Tobias Blanke,
2260–2269. and Magdalena Luszczynska. 2012. Comparison of
named entity recognition tools for raw ocr text. In
Caroline Barrière. 2016. Natural Language Under-
KONVENS, pages 410–414.
standing in a Semantic Web Context. Springer.
Wei Shen, Jianyong Wang, and Jiawei Han. 2015. En-
Federico Boschetti, Andrea Cimino, Felice
tity linking with a knowledge base: Issues, tech-
Dell’Orletta, Gianluca E Lebani, Lucia Pas-
niques, and solutions. IEEE Transactions on Knowl-
saro, Paolo Picchi, Giulia Venturi, Simonetta
edge and Data Engineering, 27(2):443–460.
Montemagni, and Alessandro Lenci. 2014. Com-
putational analysis of historical documents: An
application to italian war bulletins in world war I
and II. In Proceedings of LREC 2014 workshop on
Language resources and technologies for process-
ing and linking historical documents and archives
- deploying linked open data in cultural heritage
(LRT4HDA 2014).
Carmen Brando, Francesca Frontini, and Jean-Gabriel
Ganascia. 2016. Reden: named entity linking in
digital literary editions using linked data sets. Com-
plex Systems Informatics and Modeling Quarterly,
(7):60–80.
Maud Ehrmann, Giovanni Colavizza, Yannick Rochat,
and Frédéric Kaplan. 2016. Diachronic evaluation
of ner systems on old newspapers. In Proceedings
of the 13th Conference on Natural Language Pro-
cessing (KONVENS 2016)), number EPFL-CONF-
221391, pages 97–107. Bochumer Linguistische Ar-
beitsberichte.
Paolo Ferragina and Ugo Scaiella. 2010. Tagme:
on-the-fly annotation of short text fragments (by
wikipedia entities). In Proceedings of the 19th ACM