=Paper= {{Paper |id=Vol-2696/paper_255 |storemode=property |title=Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers |pdfUrl=https://ceur-ws.org/Vol-2696/paper_255.pdf |volume=Vol-2696 |authors=Maud Ehrmann,Matteo Romanello,Alex Flückiger,Simon Clematide |dblpUrl=https://dblp.org/rec/conf/clef/EhrmannRFC20a }} ==Extended Overview of CLEF HIPE 2020: Named Entity Processing on Historical Newspapers== https://ceur-ws.org/Vol-2696/paper_255.pdf

Extended Overview of CLEF HIPE 2020:
Named Entity Processing on Historical
Newspapers

Maud Ehrmann1[0000−0001−9900−2193] , Matteo Romanello1[0000−0002−1890−2577] ,
Alex Flückiger2 , and Simon Clematide2[0000−0003−1365−0662]
1
Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
{maud.ehrmann,matteo.romanello}@epfl.ch
2
University of Zurich, Zurich, Switzerland
{alex.flueckiger,simon.clematide}@uzh.ch

Abstract. This paper presents an extended overview of the first edi-
tion of HIPE (Identifying Historical People, Places and other Entities),
a pioneering shared task dedicated to the evaluation of named entity pro-
cessing on historical newspapers in French, German and English. Since
its introduction some twenty years ago, named entity (NE) processing
has become an essential component of virtually any text mining ap-
plication and has undergone major changes. Recently, two main trends
characterise its developments: the adoption of deep learning architectures
and the consideration of textual material originating from historical and
cultural heritage collections. While the former opens up new opportuni-
ties, the latter introduces new challenges with heterogeneous, historical
and noisy inputs. In this context, the objective of HIPE, run as part
of the CLEF 2020 conference, is threefold: strengthening the robustness
of existing approaches on non-standard inputs, enabling performance
comparison of NE processing on historical texts, and, in the long run,
fostering efficient semantic indexing of historical documents. Tasks, cor-
pora, and results of 13 participating teams are presented. Compared to
the condensed overview [31], this paper includes further details about
data generation and statistics, additional information on participating
systems, and the presentation of complementary results.

Keywords: Named entity recognition and classification · Entity linking
· Historical texts · Information extraction · Digitized newspapers · Digital
humanities

Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
ber 2020, Thessaloniki, Greece.
?
The present shared task is organized as part of “impresso - Media Monitoring
of the Past”, a project which tackles information extraction and exploration of
large-scale historical newspapers (https://impresso-project.ch/).
1 Introduction
Recognition and identification of real-world entities is at the core of virtually
any text mining application. As a matter of fact, referential units such as names
of persons, locations and organizations underlie the semantics of texts and guide
their interpretation. Around since the seminal Message Understanding Confer-
ence (MUC) evaluation cycle in the 1990s [44], named entity-related tasks have
undergone major evolutions until now, from entity recognition and classification
to entity disambiguation and linking [71, 87].

Context. Recently, two main trends characterise developments in NE process-
ing. First, at the technical level, the adoption of deep learning architectures and
the usage of embedded language representations greatly reshapes the field and
opens up new research directions [6, 63, 62]. Second, with respect to applica-
tion domain and language spectrum, NE processing has been called upon to
contribute to the field of Digital Humanities (DH), where massive digitization
of historical documents is producing huge amounts of texts [105]. Thanks to
large-scale digitization projects driven by cultural institutions, millions of im-
ages are being acquired and, when it comes to text, their content is transcribed,
either manually via dedicated interfaces, or automatically via Optical Charac-
ter Recognition (OCR). Beyond this great achievement in terms of document
preservation and accessibility, the next crucial step is to adapt and develop ap-
propriate language technologies to search and retrieve the contents of this ‘Big
Data from the Past’ [53]. In this regard, information extraction techniques, and
particularly NE recognition and linking, can certainly be regarded among the
first and most crucial processing steps.

Motivation. Admittedly, NE processing tools are increasingly being used in the
context of historical documents. Research activities in this domain target texts
of different nature (e.g., museum records, state-related documents, genealogical
data, historical newspapers) and different tasks (NE recognition and classifica-
tion, entity linking, or both). Experiments involve different time periods, focus
on different domains, and use different typologies. This great diversity demon-
strates how many and varied the needs—and the challenges—are, but also makes
performance comparison difficult, if not impossible.
Furthermore, it appears that historical texts pose new challenges to the appli-
cation of NE processing [25, 83], as they do for language technologies in general
[101]. First, inputs can be extremely noisy, with errors which do not resemble
tweet misspellings or speech transcription hesitations, for which adapted ap-
proaches have already been devised [65, 17, 100]. Second, the language under
study is mostly of earlier stage(s), which renders usual external and internal evi-
dences less effective (e.g., the usage of different naming conventions and presence
of historical spelling variations) [12, 11]. Further, beside historical VIPs, texts
from the past contain rare entities which have undergone significant changes
(esp. locations) or do no longer exist, and for which adequate linguistic resources
and knowledge bases are missing [48]. Finally, archives and texts from the past
are not as anglophone as in today’s information society, making multilingual
resources and processing capacities even more essential [26, 72].
Overall, and as demonstrated by Vilain et al. [109], the transfer of NE tools
from one domain to another is not straightforward, and the performance of NE
tools initially developed for homogeneous texts of the immediate past are af-
fected when applied on historical materials [104]. This echoes the proposition
of Plank [85], according to whom what is considered as standard data (i.e. con-
temporary news genre) is more a historical coincidence than a reality: in NLP
non-canonical, heterogeneous, biased and noisy data is rather the norm than the
exception.

Objectives. In this context of new needs and materials emerging from the hu-
manities, the HIPE shared task3 puts forward for the first time the systematic
evaluation of NE recognition and linking on diachronic historical newspaper ma-
terial in French, German and English. In addition to the release of a multilingual,
historical NE-annotated corpus, the objective of this shared task is threefold:

1. strengthening the robustness of existing approaches on non-standard inputs;
2. enabling performance comparison of NE processing on historical texts;
3. fostering efficient semantic indexing of historical documents in order to sup-
port scholarship on digitized cultural heritage collections.

The remainder of this paper is organized as follows. Section 2 briefly presents
previous work on NE processing, particularly for cultural heritage domains. Sec-
tions 3 and 4 present the tasks and the material used for the evaluation. Section
5 details the evaluation metrics and the organisation of system submissions.
Section 6 introduces the 13 participating systems while Section 7 presents and
discusses their results. Finally, Section 8 summarizes the benefits of the task and
concludes.

2 Related Work

This section briefly summarizes the evolution of NE processing and the adapta-
tion of main approaches to named entity recognition and classification (NERC)
and entity linking (EL) to the cultural heritage domain.

NE processing overview. Since the seminal Message Understanding Con-
ference series where NE recognition and classification was defined for the first
time [45], numerous research work and evaluation campaigns subsequently devel-
oped. They reflect the complexification and diversification of NE-related tasks,
as well as the evolution of information extraction from a document-oriented to
a more entity-centric perspective. First, NER setting itself evolved, with the ex-
tension of typologies [97, 38], the enlargement of the scope of linguistic units to
take into account (i.e. not only proper names) [23, 1, 2], and the consideration of
3
https://impresso.github.io/CLEF-HIPE-2020/
languages other than English, with e.g. CoNLL, ESTER, HAREM, Evalita and
Germeval [106, 39, 94, 66, 9]. Next, tasks diversified, with the introduction of
relation extraction, metonymy resolution [68] and entity coreference [7], later on
framed as entity linking with the emergence of large-scale knowledge bases [87].
Finally, besides the general domain of well-written newswire data, named entity
processing was also applied to specific domains, particularly bio-medical [55, 42],
and on more noisy inputs such as speech transcriptions [37], tweets in various
languages [89, 84, 8], and historical texts. Regarding the latter, research work
multiplied significantly during the last decade, and besides a modest evaluation
campaign on French historical texts [36], no wide-ranging and systematic evalu-
ation was organized. To the best of our knowledge, the CLEF HIPE 2020 shared
task is the first to address NE processing for multilingual, diachronic and his-
torical material.

NERC. Approaches to NERC over historical documents have grown with the
evolution of techniques, from symbolic systems to traditional machine learning
and, more recently, deep neural network-based approaches. Early approaches
include the crafting of rule-based systems based of finite-state grammars and
gazetteers, applied on e.g. American and Finish newspapers [52, 54], Swedish
literary classics [12], or British parliamentary proceedings [46]. Main reported
difficulties relate to OCR noise, often tackled via normalization rules based on
string similarity. Then, following the (relative) greater availability of raw and
annotated historical texts, research moved away in favor of machine learning
approaches. Experiments first consisted in applying existing models (the most
widely-spread being the Conditional Random Field-based Stanford NER sys-
tem [34]) in order to assess their potential and compare their performances
on OCRized vs. corrected texts [90], or on a diachronic basis [25]. Thereafter,
work focused on training new (CRF) models on custom material, such as Aus-
tralian or European newspapers [56, 73], or medieval charters [3]. Again, most
work report difficulties with bad OCR, and strategies to cope with it include
pre-processing (i.e. better sentence segmentation or word tokenization, OCR
post-correction) or string normalisation. Finally, in line with the development of
deep neural network approaches in NLP [19], new performances were attained
for NERC on well-known contemporary data sets, first with CNN-BiLSTM [18],
then with Bi-LSTM-CRF networks [63]4 . The later was widely adopted in the
processing of a variety historical texts in e.g. English [102], French [27], Ger-
man [88, 4] and Czech [50]. Neural approaches evolved further with the intro-
duction of models able to better learn context, namely contextualized string
embeddings [6] and Bidirectional Encoder Representations from Transformers
(BERT) [22]. Recently applied on historical texts, such models—when trained
on in-domain material—proved their capacity to better deal with OCR and to
improve performances [95, 62, 27].

4
CNN: Convolutional neural networks; Bi-LSTM: Bi-directional Long Short Term
Memory.
Entity Linking. Appeared most recently, the task of linking entity mentions
to their corresponding referents in a KB has received much attention since the
pioneering experiments of [15] and [20] with English Wikipedia. Given its many
applications in e.g. information retrieval, content analysis, data integration and
knowledge base population, numerous works on EL were published during the
last decade, and we refer the reader to [87] and [99] for an overview and analysis
of main approaches to EL up to the neural wave, and to [98] for an overview
of neural approaches. Main challenges in EL are name variation (several surface
forms for one entity), name ambiguity (one surface form for several entities),
and absence of the entity in the KB (NIL). Up to the apparition of neural
approaches, EL methods traditionally belonged to two families: text similarity-
based approaches (computing the similarity between the context of the mention
to link and the entity candidate description in the KB), and graph-based ap-
proaches (computing the closeness between the mention and the candidate in a
graph representing information on these objects). In both cases, main EL steps
include: mention detection, candidate selection, and candidate ranking. In the
digital humanities context, first experiments made use of existing, ‘off-the-shelf’
EL systems, such as [48] working on Dutch museum documents, or [93] on Italian
WWII partisans’ memoirs. [35] built their own graph-based system (reden) for
the disambiguation of authors’ mentions in a corpus of French literary criticism,
and demonstrated the gain of complementing a generic KB (here DBpedia) with
a domain-specific one (authority files from the French national library). Recent
neural approaches in ‘mainstream’ (i.e. non-historical) EL outperformed state-
of-the-art results, highlighting the role and importance of contextual word and
entity embeddings, together with neural similarity function [58]. To date, and
to the best of our knowledge, neural-based approaches were not yet applied on
historical texts and, if much remains to be done, the HIPE shared task started
paving the way in this direction.
Overall, experiments with NERC and EL on historical material were so far
carried out on different types of documents, following diverse guidelines, and
evaluated in isolation. The HIPE shared task allows, for the first time on such
material, to compare performances and approaches in a systematic way.

3 Task Description

The HIPE shared task includes two NE processing tasks with sub-tasks of in-
creasing level of difficulty.

Task 1: Named Entity Recognition and Classification (NERC)

– Subtask 1.1 - NERC coarse-grained (NERC-Coarse): this task includes
the recognition and classification of entity mentions according to high-level
entity types.
– Subtask 1.2 - NERC fine-grained (NERC-Fine): this task includes the
recognition and classification of mentions according to finer-grained entity
Types Sub-types
pers pers.ind pers.ind.articleauthor
pers.coll
org org.ent org.ent.pressagency
org.adm
prod prod.media
prod.doctr
date time.date.abs
loc loc.adm loc.adm.town
loc.adm.reg
loc.adm.nat
loc.adm.sup
loc.phys loc.geo
loc.hydro
loc.astro
loc.oro
loc.fac
loc.add loc.add.phys
loc.add.elec
Table 1: Entity types used for NERC tasks.

types, as well as of nested entities and entity mention components (e.g.
function, title, name).

Task 2: Named Entity Linking (EL). This task requires the linking of named
entity mentions to a unique referent in a knowledge base – here Wikidata – or to
a NIL node if the mention’s referent is not present in the base. The entity linking
task applies to non-nested mentions only and includes two settings: without and
with prior knowledge of mention types and boundaries, referred to as end-to-end
EL and EL only respectively.

4 Data

4.1 Corpus

The shared task corpus is composed of digitized and OCRized articles originat-
ing from Swiss, Luxembourgish and American historical newspaper collections
and selected on a diachronic basis.5

5
From the Swiss National Library, the Luxembourgish National Library, and the
Library of Congress (Chronicling America project), respectively. Original collections
correspond to 4 Swiss and Luxembourgish titles, and a dozen for English. More
details on original sources can be found in [28].
Corpus selection. The corpus was compiled based on systematic and purpo-
sive sampling. For each newspaper and language, articles were randomly sampled
among articles that a) belong to the first years of a set of predefined decades
covering the life-span of the newspaper (longest duration spans ca. 200 years),
and b) have a title, have more than 50 characters, and belong to any page. For
each decade, the set of selected articles was additionally manually triaged in
order to keep journalistic content only. Items corresponding to feuilleton, tab-
ular data, cross-words, weather forecasts, time-schedules, obituaries, and those
with contents that a human could not even read because of extreme OCR noise
were therefore removed. Different OCR versions of same texts are not provided,
and the OCR quality of the corpus therefore corresponds to real-life setting,
with variations according to digitization time and preservation state of original
documents. Figure 1 hereafter shows an example of a newspaper page facsimile,
a selected article thereof, and its corresponding OCR. The corpus features an
overall time span of ca. 200 years, from 1798 to 2018.

Named entity tagset and guidelines. The corpus was manually annotated
according to the HIPE annotation guidelines [30]. Those guidelines were de-
rived from the Quaero annotation guide, originally designed for the annotation
of named entities in French speech transcriptions and already used on histori-
cal press corpora [92, 91]. HIPE slightly recast and simplified this guide, con-
sidering only a subset of entity types and components, as well as of linguistic
units eligible as named entities. HIPE guidelines were iteratively consolidated
via the annotation of a ‘mini-reference’ corpus—consisting of 10 content items
per language—where annotation decisions were tested and difficult cases dis-
cussed6 . Despite these adaptations, the HIPE corpus mostly remains compatible
with Quaero-annotated data, as well as with the NewsEye project’s NE data
sets7 , annotated with guidelines derived from HIPE.
Table 1 presents the entity types and sub-types used for annotation, which
participant systems had to recognize for NERC-Coarse (types) and NERC-Fine
(most fine-grained sub-types). Named entity components, annotated for the type
Person only, correspond to name, title, function, qualifier and demonym.
Nested entities were annotated for Person, Organization and Location (a
depth of 1 was considered during the evaluation), as well as metonymic senses,
producing double tags for those entities referring to something intimately as-
sociated (metonymic sense) to the concept usually associated with their name
(literal sense). As per entity linking, links correspond to Wikidata QIDs8 .

6
The mini-reference corpus was released during the initial phase of the shared task as
sample data and is available at https://github.com/impresso/CLEF-HIPE-2020/
tree/master/data/sample-v1.0.
7
https://www.newseye.eu/
8
The November 2019 dump used for annotation is available at https://files.ifi.
uzh.ch/cl/impresso/clef-hipe.
Fig. 1: Illustration of HIPE data, from scanned image to NE annotation and
CoNLL-U-liked released material. Example taken from the Gazette de Lausanne
1908.07.01, page 2.

Annotation framework. We used INCEpTION, a web-based platform for text
annotation and knowledge management [57]. Segment (d) of Figure 1 offers two
screenshots of the annotation platform, with the annotation of a person mention
and its function (left), and of location mentions, with their metonymic senses
(right). The functionalities of INCEpTION that proved particularly useful dur-
ing the annotation campaign were: a) the support for querying against very large
knowledge bases (e.g. Wikidata) with regards to EL annotation; b) the possi-
bility of enabling the use of recommenders, which can considerably speed up
the annotation process (e.g. when the very same named entity occurs multiple
times within the same document); and c) the provision of an Abstract Pro-
gramming Interface (API) that allows for automating certain operations, such
as the bulk import/export of annotated documents. Moreover, since the shared
task was one of the official use cases of the INCEpTION project,9 some of the
annotation platform’s features were developed to accommodate specific needs
of the HIPE annotation campaign, most notably the ability of displaying im-
age segments alongside OCR transcriptions. Nevertheless, some aspects of our
annotation process did not perfectly fit the generic workflows implemented in
INCEpTION. First, the annotation by multiple annotators of different layers (i.e.
mentions and entity links) within the same document and, second, the validation
of annotated data so as to verify, for example, that every annotated mention has
either a Wikidata link or the NIL flag. Both limitations were overcome by means
of scripts based on the API.
Overall, INCEpTION proved to be a mature, stable and highly configurable
annotation platform, able to support the complex workflows required by a col-
laborative annotation campaign such as the one undertaken for HIPE, as well
as to deal with the specifities of historical newspaper data.

Annotation difficulties. The annotation campaign was carried out by the task
organizers with the contribution of trilingual collaborators. Before starting an-
notating, each annotator was first trained on the mini-reference corpus in order
to ensure a good understanding of the guidelines. This workflow proved to be
valuable in resolving instruction’s imprecisions and annotator’s doubts, however
some unclear points persisted and new difficulties appeared throughout the an-
notation campaign. As per NERC, major complications included, among others:
a) the determination of entity boundaries in case of long functions or titles in
apposition (e.g. M. Curtoys d’Anduaga, doyen du corps diplojtëlfsue espagnol, et
ministre plénipotentiaire pendant 50 ans)10 ; b) the determination of what is to be
considered (or not) as an Organization: despite clear specification, the definition
of this class is not clear-cut and there are always groupings of some sort which
prompt an interpretation as Organization, while they are not11 (e.g. Commis-
sion impériale, les gouvernements de l’Entente, Die französische Regierung); c)
the qualification of a location name as being of a region (loc.adm.reg) or of
a nation (loc.adm.nat), particularly in a historical context (e.g. Savoie, Mol-
davia); d) the entanglement of entities, some of which have to be identified as
nested or as components (e.g. the mix of Person, Function and Organisation in
Chez Manguet et Cherbuliez imprimaires-libraires à Genève); e) the harmoniza-

9
https://inception-project.github.io/use-cases/impresso/
10
In these cases, we found that the annotation of components was greatly supporting
the definition of entities’ scope.
11
According to our guidelines.
tion of rules across languages, e.g. with German compounds (e.g. Zürichputsch,
Baslerpropaganda); f) the attempt to avoid country-related biases, such as the
importance and role of canton councils in Switzerland vs. in other countries;
and f) the annotation of metonymy, whose interpretation is rather subjective
and may differ between annotators. We had no difficulties related to unread-
able OCR since extremely noisy articles were filtered out beforehand, and since
annotators could see original facsimiles while annotating.
With respect to Entity Linking, difficulties naturally related to the historical
nature of the material. If it is highly preferable to have some historical back-
ground knowledge related to the collection—which most HIPE annotators had—,
it appears that this is not in itself any guarantee of a swift resolution of mention
referents. As a matter of fact, most person mentions in newspapers correspond
to people who enjoyed a certain popularity at a specific time, but who are now
medium- or little-known, except for experts in this or that Spanish dynasty,
Swiss mountain tunnel, or local football club. As a result, historical background
knowledge was mainly helpful for cases involving VIPs (e.g. Wilhelm II, Jean
Jaures), and the linking of person mentions often proved to be comparable to
detective work where one has to first understand who could be the person (by
cross-referencing clues), before finding its ID in Wikidata (which, to our surprise,
existed quite often). As a lesson learned, curiosity, persistence and investigation
skills are as important as historical knowledge. Besides the mere identification
of who’s who, another difficulty was the choice of the relevant Wikidata ID for
‘changing’ entities, often locations, whose geographical and/or administrative
realities evolved through time. Here the main issue turned out to be the unequal
‘tracking’ in the KB of the various historical statuses an entity could have had:
while some countries might have an entry for each of their geopolitical phase
(e.g. all French political regimes or German-related states throughout 19 and
20C), others have only a generic entry. This posed the problem of the coherency
of annotation granularity, which, despite consistency checks, is not fully guaran-
teed in our data set, since none of the annotation—specific or generic—is entirely
wrong, and annotators did take different decisions in the heat of annotation. The
fuzzy setting in EL evaluation mitigates this aspect (see Section 5).
Overall, besides being time consuming, the annotation of multilingual his-
torical texts proved to be rather challenging compared to our experience on
contemporary data. As future improvements we took note of detailing further
some points of the guidelines, specifically with respect to metonymy annotation
and to entity linking.

Annotators’ agreement. The inter-annotator agreement rates between two
annotators were computed on a selection of documents (test set) using Krip-
pendorf’s α [59], as provided by INCEpTION version 0.15.2. Scores correspond
to, for Fr, De and En respectively: .81, .79 and .80 for NERC, .73, .69 and .78
for linking towards a QID, and .95, .94 and .90 for linking towards NIL. NERC
and linking towards NIL show a good agreement between annotators. The lower
scores on entity linking confirm the difficulty of the task, especially in the con-
Lg. Docs Tokens Mentions Literal Metonymic Nested Comp.

M %noisy %NIL M %NIL

Train fr 158 166218 7376 6925 9.27 28.16 451 3.33 473 3051
de 104 86961 3832 3506 8.56 18.40 326 1.23 159 1441
en - - - - - - - - - -
Total 262 253179 11208 10431 9.03 24.88 777 2.45 632 4492

Dev fr 43 37953 1835 1727 10.65 22.52 108 0.00 91 724
de 40 36176 1487 1390 18.92 23.88 97 3.09 75 563
en 80 29060 981 966 1.35 45.86 15 0.00 - -
Total 163 103189 4303 4083 11.27 28.51 220 1.36 166 1287

Test fr 43 40855 1712 1600 13.13 22.81 112 1.79 82 709
de 49 30738 1265 1147 15.34 20.92 118 0.00 73 431
en 46 16635 474 449 7.13 42.54 25 0.00 - -
Total 138 88228 3451 3196 13.08 24.91 255 0.78 155 1140

All fr 244 245026 10923 10252 33.05 73.50 671 5.11 646 4484
de 193 153875 6584 6043 42.82 63.21 541 4.32 307 2435
en 126 45695 1455 1415 8.47 88.40 40 0.00 - -

Total 563 444596 18962 17710 10.28 25.72 1252 1.92 953 6919

Table 2: Overview of corpus statistics (v1.3). M stands for number of mentions,
%noisy stands for the percentage of mentions with at least one OCR error, and
%NIL stands for the percentage of mentions linked to NIL.

text of historical documents. The low score observed on German (.69) is due to
annotation discrepancies with respect to the linking of metonymic entities.

Corpus characteristics. For each task and language—with the exception of
English—the HIPE corpus was divided into training, dev and test data sets
(70/15/15). English was included later in the shared task and only dev and
test sets were released for this language. The overall corpus consists of 563 an-
notated documents, for a total of 444,596 tokens and 18,962 (linked) mentions
(see Table 2 for detailed overview statistics12 ). With 10,923 and 6,584 mentions,
French and German corpora are larger than the English one (1,455). Despite
our efforts to devise a balanced sampling strategy, the diachronic distribution
of mentions is not entirely uniform across languages (see Fig. 2). This is mainly
due to the following factors: the temporal boundaries of data to sample from
(the German corpus stops at 1950, and the English one shortly afterwards);
the varying content of newspaper articles; and, finally, the difficulty of sampling
enough materials for certain decades due to OCR noise, such is the case with
years 1850-1879 in the English corpus.
12
These statistics are slightly different than those presented in [31] but, after thorough
double checked, are to be considered as the reference ones.
An important aspect of the HIPE corpus, and of historical newspaper data
in general, is the noise generated by OCR. Annotators were asked to transcribe
the surface forms of noisy mentions so as to enable studying the impact of noisy
mentions on NERC and EL tasks. In the test set—where we manually verified
the consistency of annotators’ transcriptions—about 10% of all mentions contain
OCR mistakes.
Together with OCR, the limited coverage of knowledge bases such as Wiki-
data tends to have an impact on historical NE processing, and especially on
linking. In our corpus, 25.72% of all literal mentions could not be linked to a
Wikidata entry (NIL entities). Interestingly, and contrary to our initial assump-
tion, NIL entities are uniformly distributed across time periods (see Fig 3). The
NIL ratio is higher for Person, Media and Organisation entities, whereas for
geographic places (Location) Wikidata shows a substantial coverage (see Table
3). Date mentions were not linked as per HIPE annotation guidelines.

Fig. 2: Diachronic distribution of men-
Fig. 3: Diachronic ratio of NIL entities.
tions across languages.

Corpus release. Data sets were released in IOB format with hierarchical infor-
mation, in a similar fashion to CoNLL-U13 , and consist of UTF-8, tab-separated-
values files containing the necessary information for all tasks (NERC-Coarse,
NERC-Fine, and EL).
Given the noisy quality of the material at hand, we chose not to apply sen-
tence splitting nor sophisticated tokenization but, instead, to provide all neces-
sary information to rebuild the OCR text. The tokenization applied to produce
the IOB files is based on simple white space splitting, leaving all punctuation
signs (including apostrophes) as separate tokens.14 Participants could choose to
apply their own sentence splitting and tokenization. Alongside each article, meta-
data (journal, date, title, page number, image region coordinates) and IIIF links
13
https://universaldependencies.org/format.html
14
The flag ‘NoSpaceAfter’ provides information about how to reconstruct the text.
Type Lg. Literal Metonymic Lit.+Meto. Nested
M L %NIL M L %NIL M L %NIL

Location fr 4716 4487 4.86 9 7 22.22 4725 4494 4.89 523
de 3006 2883 4.09 30 30 0.00 3036 2913 4.05 209
en 565 487 13.81 9 9 0.00 574 496 13.59 -
Total 8287 7857 5.19 48 46 4.17 8335 7903 5.18 732

Person fr 3704 2051 44.63 0 0 0.00 3704 2051 44.63 33
de 1910 1332 30.26 3 3 0.00 1913 1335 30.21 29
en 558 154 72.40 0 0 0.00 558 154 72.40 -
Total 6172 3537 42.69 3 3 0.00 6175 3540 42.67 62

Organization fr 1124 779 30.69 661 652 1.36 1785 1431 19.83 87
de 660 458 30.61 507 505 0.39 1167 963 17.48 61
en 194 120 38.14 31 31 0.00 225 151 32.89 -
Total 1978 1357 31.40 1199 1188 0.92 3177 2545 19.89 148

Date fr 398 398 0.00 1 1 0.00 399 399 0.00 0
de 240 240 0.00 0 0 0.00 240 240 0.00 6
en 46 46 0.00 0 0 0.00 46 46 0.00 -
Total 684 684 0.00 1 1 0.00 685 685 0.00 6

Media fr 310 231 25.48 0 0 0.00 310 231 25.48 3
de 227 153 32.60 1 1 0.00 228 154 32.46 2
en 52 20 61.54 0 0 0.00 52 20 61.54 -
Total 589 404 31.41 1 1 0.00 590 405 31.36 5

Grand Total 17710 13839 21.86 1252 1239 1.04 18962 15078 20.48 953

Table 3: Statistics per coarse entity type over all data sets, divided per language
and per reading (literal and metonymic) and annotation depth (nested) types.
M stands for ‘mentions’ (i.e. number of mentions), L stands for ‘linked men-
tions’ (i.e. number of mentions linked to Wikidata), and %NIL stands for the
percentage of mentions linked to NIL.

to original page images are additionally provided when available [29]. Segment
(e) of Figure 1 corresponds to an excerpt of the IOB HIPE data.
The HIPE corpus, comprising several versions of each data set for the 3 lan-
guages, is released under a CC BY-NC 4.0 license15 and is available on Zenodo16
as well as on the HIPE GitHub repository17 .

4.2 Auxiliary Resources

In order to support participants in their system design and experiments, we
provided auxiliary resources in the form of ‘in-domain’ word and character-level
15
https://creativecommons.org/licenses/by-nc/4.0/legalcode
16
https://zenodo.org/deposit/3706857
17
https://github.com/impresso/CLEF-HIPE-2020/tree/master/data
embeddings acquired from the same impresso newspapers titles and time pe-
riods from which HIPE training and development sets were extracted. Those
embeddings correspond to fastText word embeddings [10] and flair contextual-
ized string embeddings [5], both for French, German and English.
More specifically, fastText embeddings came in two versions, with subword 3-
6 character n-grams and without, and were computed after a basic pre-processing
(i.e., lower-casing, replacement of digits by 0 and deletion of all tokens and punc-
tuation signs of length 1) that also tried to imitate the tokenization of the shared
task data. Flair character embeddings were computed using flair 0.4.518 with a
context of 250 characters, a batch size of 400-600 (depending on the GPU’s mem-
ory), 1 hidden layer (size 2048), and a dropout of 0.1. Input was normalized with
lower-casing, replacement of digits by 0, and of newlines by spaces; everything
else was kept as in the original text (e.g. tokens of length 1). It is to be noted
that the amount of training material greatly differed between languages (20G for
French and 8.5G for German taken from Swiss and Luxembourgish newspapers;
1.1G for English taken from Chronicling America material).
These embeddings are released under a CC BY-SA 4.0 license19 and are avail-
able for download.20 Contextualized character embeddings were also integrated
into the flair framework21 .

5 Evaluation Framework
5.1 Evaluation Measures
NERC and EL tasks are evaluated in terms of Precision, Recall and F-measure
(F1) [67]. Evaluation is done at entity level according to two metrics: micro
average, with the consideration of all TP, FP, and FN22 over all documents,
and macro average, with the average of document’s micro figures. Our definition
of macro differs from the usual one: averaging is done at document-level and
not across entity-types, and allows to account for (historical) variance in doc-
ument length and entity distribution within documents instead of overall class
imbalances.
Both NERC and EL benefit from strict and fuzzy evaluation regimes. For
NERC (Coarse and Fine), the strict regime corresponds to exact boundary
matching and the fuzzy to overlapping boundaries. It is to be noted that in
the strict regime, predicting wrong boundaries leads to a ‘double’ punishment of
one false negative (entity present in the gold standard but not predicted by the
system) and one false positive (entity predicted by the system but not present
in the gold standard). Although it punishes harshly, we keep this metric to be
in line with CoNLL and refer to the fuzzy regime when boundaries are of less
importance.
18
https://github.com/flairNLP/flair
19
https://creativecommons.org/licenses/by-sa/4.0/legalcode
20
https://files.ifi.uzh.ch/cl/siclemat/impresso/clef-hipe-2020/flair/
21
https://github.com/flairNLP/flair
22
True positive, False positive, False negative.
The definition of strict and fuzzy regimes differs for entity linking. In terms of
boundaries, EL is always evaluated according to overlapping boundaries in both
regimes (what is of interest is the capacity to provide the correct link rather
than the correct boundaries). EL strict regime considers only the system’s top
link prediction (NIL or QID), while the fuzzy regime expands system predic-
tions with a set of historically related entity QIDs. For example, “Germany”
QID is complemented with the QID of the more specific “Confederation of the
Rhine” entity and both are considered as valid answers. The resource allowing
for such historical normalization was compiled by the task organizers for the
entities of the test data sets, and is released as part of the HIPE scorer. For this
regime, participants were invited to submit more than one link, and F-measure
is additionally computed with cut-offs @3 and @5.
The HIPE scorer23 was provided to the participants early on and is published
under the MIT license. After the evaluation phase, a complete HIPE evaluation
toolkit was also released, including the data used for evaluation (HIPE corpus
v1.3), the anonymized system runs submitted by participating teams, and all the
recipes and resources (e.g. historical mappings) required to replicate the present
evaluation24 .

5.2 Task Bundles

In order to allow the greatest flexibility to participating teams as to which tasks
to compete for while keeping a manageable evaluation frame, we introduced a
system of task bundles offering different task combinations (see Table 4). Teams
were allowed to choose only one bundle per language and to submit up to 3 runs
per language. Only Bundle 5 (EL only) could be selected in addition to another
one; this exception was motivated by the intrinsic difference between end-to-end
linking and linking of already extracted entity mentions. Detailed information
on system submission can be found in the HIPE Participation Guidelines [29].

Bundle Tasks # teams # runs
1 NERC coarse, NERC fine and EL 2 10
2 NERC coarse and EL 3 10
3 NERC coarse and NERC fine 1 8
4 NERC coarse 7 27
5 EL only 5 20

Table 4: Task bundles.

23
https://github.com/impresso/CLEF-HIPE-2020-scorer
24
https://github.com/impresso/CLEF-HIPE-2020-eval
6 System Descriptions
In this first HIPE edition, 13 participating teams submitted a total of 75 sys-
tem runs. All teams participated to NERC-Coarse, 3 to NERC-Fine, and 5 to
end-to-end EL and EL only. The distribution of runs per language reflects the
data, with 35 runs for French (42%), 26 for German (31%), and 22 for English
(26%). Besides, six teams worked on all 3 languages. For NERC, all but 2 teams
applied neural approaches, and most of them also worked with contextualized
embeddings, in particular with BERT embeddings [108].

6.1 Baselines
As a baseline for NERC-Coarse, we trained a traditional CRF sequence classifier
[77] using basic spelling features such as a token’s character prefix and suffix,
the casing of the initial character, and whether it is a digit. The model, released
to participating teams as part of the HIPE scorer, dismisses the segmentation
structure and treats any document as a single, long sentence. No baseline is
provided for the NERC-Fine sub-task.
The baseline for entity linking (end-to-end EL and EL only) corresponds
to AIDA-light [74], which implements the collective mapping algorithm by [47].
The wikimapper25 tool was used to map Wikipedia URLs onto Wikidata QIDs,
and the end-to-end EL baseline run relied on the CRF-based NERC baseline.
Given the multilingual nature of the HIPE shared task, it is worth noting that
AIDA-light was trained on a 2014 dump of the English Wikipedia, therefore
accounting for a generous baseline.

6.2 Participating Systems
The following system descriptions are compiled from information provided by
the participants. More accurate implementation details for most of the systems
are available in the participants’ papers [16]. As preliminary remarks, it can be
noted that for NERC many teams experimented with different input embeddings,
often testing character, sub-word and word-level representations trained on con-
temporary or historical material, and often combining classical type-level word
embeddings (fastText) with contextualized embeddings (BERT, Flair, ELMo).
Several teams also tried to improve the (newspaper) line-based input format by
reconstructing linguistically motivated sentences and uniting hyphenated words.
This preprocessing step turned out to be helpful.
Cisteria, a collaboration of the Ludwig-Maximilians Universität and the
Bayerische Staatsbibliothek München from Germany, focused on NERC-coarse
for German [96]. They experimented with external and HIPE character and word
embeddings as well as several transformer-based BERT-style language models
(e.g., German Europeana BERT26 ), all integrated by the neural flair NER tag-
ging framework [5]. They used a state-of-the-art bidirectional LSTM with a
25
https://github.com/jcklie/wikimapper
26
https://huggingface.co/dbmdz
Conditional Random Field (CRF) layer as proposed by [49]. As a result of their
experiments with a variety of pre-trained monolingual and multilingual word
representations, they finally used different embeddings for literal and metonymic
NERC models. No additional NER training material was used.
Ehrmama, affiliated with the University of Amsterdam, tackled coarse and
fine-grained NERC for all languages [107]. They build on the bidirectional LSTM-
CRF architecture of [63] and introduce a multi-task approach by splitting the
top layers for each entity type. Their general embedding layer combines a mul-
titude of embeddings, on the level of characters, sub-words and words; some
newly trained by the team, as well as pre-trained BERT and HIPE’s in-domain
fastText embeddings. They also vary the segmentation of the input: line segmen-
tation, document segmentation as well as sub-document segmentation for long
documents. Their results suggest that splitting the top layers for each entity type
is not beneficial. However, the addition of various embeddings improves the per-
formance. Using (sub-)document segmentation clearly improved results when
compared to the line segmentation found in newspapers. No additional NER
training material was used for German and French; for English, the Groningen
Meaning Bank [14] was adapted for training.
Ertim, affiliated with Inalco, Paris, applied their legacy (2010-13) NER sys-
tem mXS27 [75] for contemporary texts on the historical French HIPE data with-
out any adaptation or training [76]28 . The system uses pattern mining and non-
neural machine learning for NERC and their model is based on the QUAERO
standard [92], which is the basis for the HIPE annotation guidelines. For EL,
only the type Person was considered. The resolution is done in two steps, first an
approximate string match retrieves French Wikipedia pages, second the Wiki-
data item is selected whose Wikipedia article has the highest cosine similarity
with the HIPE newspaper article containing the mention.
Inria, by the ALMAnaCH project team affiliated at Inria, Paris, used
DeLFT (Deep Learning Framework for Text)29 for NERC tagging of English
and French [60]. For English, the pre-trained Ontonotes 5.0 CoNLL-2012 model
was used with a BiLSTM-CRF architecture. For EL-only, the off-the-shelf named
entity recognition and linking system entity-fishing 30 was run on the HIPE data
for predicting links for the literal meaning. For English, it achieved the best per-
formance overall, for French, it ranked second best in F1 score. This Wikipedia-
based system specifically stands out with its high recall.
Irisa, by a team from IRISA, Rennes, France, focused on French NERC and
EL [33]. For NERC, they improved the non-neural CRF baseline system with
additional features such as context tokens, date regex match, ASCII normaliza-
tion of the focus token, and the 100 most similar words from the HIPE fastText
word embeddings provided by the organizers. For EL, a knowledge-base driven
approach was applied to disambiguate and link the mentions of their NERC
27
https://github.com/eldams/mXS
28
The final paper contains post-submission experiments.
29
https://github.com/kermitt2/delft
30
https://github.com/kermitt2/entity-fishing
systems and the gold oracle NERC mentions [32]. Their experiments with the
HIPE data revealed that collective entity linking is also beneficial for this type
of texts—in contrast to linking mentions separately.
L3i, by the L3i laboratory team affiliated with La Rochelle University, France,
tackled all prediction tasks of HIPE for all languages and achieved almost ev-
erywhere the best results [13]. For NERC, they used a hierarchical transformer-
based model [108] built upon BERT [22] in a multi-task learning setting. On
top of the pre-trained BERT blocks (multilingual BERT for all languages, addi-
tionally Europeana BERT for German31 and CamemBERT for French [69]), two
task-specific transformer layers were optionally added to alleviate data sparsity
issues, for instance out-of-vocabulary words, spelling variations, or OCR errors
in the HIPE dataset. A state-of-the-art CRF layer was added on top in order to
model the context dependencies between entity tags. For fine-tuning, relatively
small batch sizes were used: 4 for German and English, 2 for French. For base
BERT with a limited context of 512 sub-tokens, documents are too long and
newspaper lines are too short for proper contextualization. Therefore, an im-
portant pre-processing step consisted in the reconstruction of hyphenated words
and in sentence segmentation with Freeling [80]. The team submitted several
runs based on different configurations of their model and resources. For the two
languages with in-domain training data (French and German), the results of run
1 on literal NERC-coarse without the two transformer layers were slightly lower
(roughly 1 percentage point in F score) than run 2 with transformer layers. For
English without in-domain training data, two options for fine-tuning were tested:
a) training on monolingual CoNLL 2003 data, and b) transfer learning by train-
ing on the French and German HIPE data. Both options worked better without
transformer layers, (a) was slightly better on strict boundary evaluation, and (b)
on fuzzy boundary evaluation. For their EL approach, based on [58], the team
built a Wikipedia/Wikidata knowledge base per language and trained entity em-
beddings for the most frequent entries [40]. Based on Wikipedia co-occurrence
counts, a probabilistic mapping table was computed for linking mentions with
entities—taking several mention variations (e.g. lowercase, Levenshtein distance)
into account to improve the matching. The candidates were filtered using DB-
pedia and Wikidata by prioritizing those that corresponded to the named entity
type. For persons, they analysed the date of birth to discard anachronistic enti-
ties. Finally, the five best matching candidates were predicted.
Limsi, affiliated with LIMSI, CNRS, Paris, France, focused on NERC-coarse
for French and achieved second best results there [41]. They submitted runs
from 3 model variations: a) A model based on CamemBERT [69] that jointly
predicts the literal and metonymic entities by feeding into two different softmax
layers. This model performed best on the dev set for metonymic entities. b)
The model (a) with a CRF layer on top, which achieved their best results on
literal tags (F1=.814 strict). c) A standard CamemBERT model that predicts
concatenated literal and metonymic labels directly as a combined tag (resulting

31
https://github.com/stefan-it/europeana-bert
in a larger prediction tagset). This model performed best (within Limsi’s runs)
on the test set for metonymic entities (F1=.667 strict).
Nlp-uqam, affiliated with Université du Quebec, Montréal, Canada, focused
on coarse NERC for French [21]. Their architecture involves a BiLSTM layer for
word-level feature extraction with a CRF layer on top for capturing label depen-
dencies [63], and an attention layer in between for relating different positions of
a sequence [108]. For their rich word representation, they integrate a character-
based CNN approach [18] and contextualized character-based flair embeddings
[6] as provided by the HIPE organizers.
Sbb, affiliated with the Berlin State Library, Berlin, focused on NERC-coarse
and EL for all languages [61]. For NERC, they applied a model based on multi-
lingual BERT embeddings, which were additionally pre-trained on OCRed his-
torical German documents from the SBB collection and subsequently fine-tuned
on various multilingual NER data sets [62]. For EL, they constructed a multi-
lingual knowledge base from Wikipedia (WP) articles roughly resembling the
categories Person, Location, and Organization. The title words of these pages
were embedded by BERT and stored in a nearest neighbor lookup index. A
lookup applied to a mention returns a set of linked entity candidates. The his-
torical text segment containing the mention and sentences from WP containing
a candidate are then scored by a BERT sentence comparison model. This model
was trained to predict for arbitrary WP sentence pairs whether they talk about
the same entity or not. A random forest classifier finally ranks the candidates
based on their BERT sentence comparison scores.
SinNer, affiliated with INRIA and Paris-Sorbonne University, Paris, France,
focused on literal NERC-Coarse for French and German [78] and ranked third
on both languages with their best neural run. The team preprocessed the line-
based format into sentence-split segments. They provided two runs based on a
BiLSTM ELMo architecture [82]. Run 1 is based on the classical ELMo archi-
tecture (without a CRF layer), combining type-level CNN word representations
with a contextualized two-layer ELMo representation. For run 2, which performs
better than their run 1 and is the one reported here, they combined modern
Common Crawl-based fastText [43, 10] and pre-trained contextualized ELMo
embeddings32 in a modern BiLSTM-CRF architecture [103]. They optimized
hyperparameters by training each variant three times and by selecting on F1
score performance on the dev set. For run 3, they retrained SEM33 with the of-
ficial HIPE data sets and applied entity propagation on the document level. For
German, they augmented SEM’s gazetteers with location lexicons crawled from
Wikipedia. The considerably lower performance of run 3 illustrates the advan-
tage of embedding-based neural NER tagging. Ablation experiments on sentence
splitting showed an improvement of 3.5 F1 percentage points on French data for
their neural system of run 1.

32
[79] for French, [70] for German.
33
SEM [24] is a CRF-based tool using Wapiti [64] as its linear-chain CRF implemen-
tation.
Upb, affiliated with the Politehnica University of Bucharest, Bucarest, Bul-
garia, focused on literal NERC-coarse for all languages. Their BERT-based
model centers around the ideas of transfer and multi-task learning as well as
multilingual word embeddings. Their best performing runs combine multilin-
gual BERT embeddings with a BiLSTM layer followed by a dense layer with
local SoftMax predictions or alternatively, by adding a CRF layer on top of the
BiLSTM.
Uva-ilps, affiliated with the University of Amsterdam and Radboud Uni-
versity, The Netherlands, worked on NERC-coarse and end-to-end EL for literal
senses in all languages, as well as on literal and metonymic EL-only for En-
glish [86]. They fine-tuned BERT models for token-level NERC prediction using
Huggingface’s transformer framework [110], using the cased multilingual BERT
base model for French and German and the cased monolingual BERT base model
for English. For training their English model, they used the CoNLL-03 data
[106]. Their end-to-end EL approach was implemented by searching for each
entity mention in the English Wikidata dump indexed by ElasticSearch34 , an
approach that outperformed the baseline system. The main problem there was
the lack of German and French entities, although person names still could be
found. For run 1 and 2 of EL-only on English, they improved the candidate entity
ranking by calculating cosine similarities between the contextual embeddings of
a sentence containing the target entity mention and a modified sentence where
the mention was replaced with a candidate entity description from Wikidata.
The semantic similarity scores were multiplied by relative Levenshtein similarity
scores between target mention and candidate labels to prefer precise character-
level matches. Run 2 added historical spelling variations, however, this resulted
in more false positives. Run 3 used REL [51], a completely different neural NERC
and EL system. Candidate selection in REL is twofold, 4 candidates are selected
by a probabilistic model predicting entities given a mention, and 3 candidates
are proposed by a model predicting entities given the context of the mention.
Candidate disambiguation combines local compatibility (prior importance, con-
textual similarity) and global coherence with other document-level entity linking
decisions. Their REL-based run 3 outperformed their runs 1 and 2 clearly.
Webis, by the Webis group affiliated with the Bauhaus University Weimar,
Germany, focused on NERC-coarse for all languages. For each language, they
trained a flair NERC sequence tagger [5] with a CRF layer using a stack of
four embeddings: Glove embeddings [81], contextual character-based flair embed-
dings, and the forward and backward HIPE character-based flair embeddings.
Their pre-processing included sentence reconstruction (by splitting the token se-
quence on all periods, except after titles, month abbreviations or numbers), and
dehyphenation of tokens at the end of lines. For German, they experimented
with data augmentation techniques by duplicating training set sentences and re-
placing the contained entities by randomly chosen new entities of the same type
retrieved from Wikidata. A post-processing step resolved IOB tag sequence in-
consistencies and applied a pattern-based tagging for time expressions. Although
34
https://www.elastic.co/
internal dev set validation F1-scores looked promising, their official results on
the test set had a bias towards precision. This could be due to format conversion
issues.

7 Results and Discussion

We report results for the best run of each team and consider micro Precision, Re-
call and F1 scores exclusively. Results for NERC-Coarse and NERC-Fine for the
three languages, both evaluation regimes and the literal and metonymic senses
are presented in Table 5 and 6 respectively, while results for nested entities and
entity components are presented in Table 7. Table 8 reports performances for
end-to-end EL and EL only, with a cut-off @1 and Table 9 for EL only with
cut-offs @3 and @5.

General observations. Neural systems with strong embedding resources clearly
prevailed in HIPE NERC, beating symbolic CRF or pattern-matching based ap-
proaches by a large margin (e.g., compare baseline performance in Table 5).
However, we also notice performance differences between neural systems that
rely on BiLSTMs or BERT, the latter generally performing better.
In general and not unexpectedly, we observe that the amount of available
training and development data correlates with system performances. French with
the largest amount of training data has better results than German, and English
is worse than German (see median numbers in Table 5). The one exception is
EL only where English, as a well-resourced language, seems to have the neces-
sary tooling to also excel on non-standard, historical text material (cf. Inria
results). NERC-Coarse performances show a great diversity but top results are
better than expected, specifically for French where they are almost on a par
with performances on contemporary texts. Here, six teams have fuzzy F1 scores
higher than .8, suggesting good prospects for entity extraction systems on his-
torical texts, when trained with appropriate and sufficient data. Fine-grained
NERC with more than 12 classes is obviously more difficult than predicting only
5 categories. However, the performance drop of the best performing system L3i
is relatively mild for French, 6.5 percentage points on fuzzy F1, and a little
stronger for German (10.7).
The recognition of entity components shows reasonable performances and
suggests that knowledge base population and/or biography reconstruction from
historical texts is feasible. The same cannot be said of nested entities.
Finally, EL performances are, as expected, lower than for NERC (best F1
score in the range of .58 to .63 for EL only strict across languages), and systems’
performances are as diverse (cf. Table 8). The propagation of NERC mistakes
in the end-to-end setting induces lower performances, however the provision of
mention boundaries does not drastically improve results (e.g. 4 percentage points
for the best system on French), suggesting that being able to deal with OCR
noise (provided mentions are not OCR-corrected) and NIL entities is as impor-
tant as exact mention recognition. When given the possibility to provide a list of
French German English
(a) Literal Strict Fuzzy Strict Fuzzy Strict Fuzzy
P R F P R F P R F P R F P R F P R F

Cisteria - - - - - - .745 .578 .651 .880 .683 .769 - - - - - -
Ehrmama .793 .764 .778 .893 .861 .877 .697 .659 .678 .814 .765 .789 .249 .439 .318 .405 .633 .494
Ertim .435 .248 .316 .604 .344 .439 - - - - - - - - - - - -
Inria .605 .675 .638 .755 .842 .796 - - - - - - .461 .606 .524 .568 .746 .645
Irisa .705 .634 .668 .828 .744 .784 - - - - - - - - - - - -
L3i .831 .849 .840 .912 .931 .921 .790 .805 .797 .870 .886 .878 .623 .641 .632 .794 .817 .806
Limsi .799 .829 .814 .887 .909 .898 - - - - - - - - - - - -
Nlp-uqam .705 .634 .668 .828 .744 .784 - - - - - - - - - - - -
Sbb .530 .477 .502 .765 .689 .725 .499 .484 .491 .730 .708 .719 .347 .310 .327 .642 .572 .605
SinNer .788 .802 .795 .886 .902 .894 .658 .658 .658 .775 .819 .796 - - - - - -
Upb .693 .686 .689 .825 .817 .821 .677 .575 .621 .788 .740 .763 .522 .416 .463 .743 .592 .659
Uva-ilps .656 .719 .686 .794 .869 .830 .499 .556 .526 .689 .768 .726 .443 .508 .473 .635 .728 .678
Webis .731 .228 .347 .876 .273 .416 .695 .337 .454 .833 .405 .545 .476 .067 .117 .873 .122 .215

Baseline .693 .606 .646 .825 .721 .769 .643 .378 .476 .790 .464 .585 .531 .327 .405 .736 .454 .562
Median .705 .680 .677 .828 .829 .808 .686 .576 .636 .801 .752 .766 .461 .439 .463 .642 .633 .645

Strict Fuzzy Strict Fuzzy Strict Fuzzy
(b) Meto.
P R F P R F P R F P R F P R F P R F

Cisteria - - - - - - .738 .500 .596 .787 .534 .636 - - - - - -
Ehrmama .697 .554 .617 .708 .562 .627 .696 .542 .610 .707 .551 .619 - - - - - -
L3i .734 .839 .783 .734 .839 .783 .571 .712 .634 .626 .780 .694 .667 .080 .143 1.00 .120 .214
Limsi .647 .688 .667 .655 .696 .675 - - - - - - - - - - - -
Nlp-uqam .423 .420 .422 .468 .464 .466 - - - - - - - - - - - -

Baseline .541 .179 .268 .541 .179 .268 .814 .297 .435 .814 .297 .435 1.00 .040 .077 1.00 .040 .077
Median .647 .554 .617 .655 .562 .627 - - - - - - - - - - - -

Table 5: Results for NERC-Coarse (micro P, R and F-measure). Bold font indi-
cates the highest, and underlined font the second-highest value.

French German
(a) Literal Strict Fuzzy Strict Fuzzy
P R F P R F P R F P R F

Ehrmama .696 .724 .710 .776 .807 .791 .650 .592 .620 .754 .687 .719
Ertim .418 .238 .303 .568 .324 .412 - - - - - -
L3i .772 .797 .784 .843 .869 .856 .628 .712 .668 .734 .813 .771

(b) Metonymic
Ehrmama .667 .554 .605 .667 .554 .605 .707 .551 .619 .717 .559 .629
L3i .718 .661 .688 .738 .679 .707 .601 .703 .648 .659 .771 .711

Table 6: Results for NERC-Fine.
French German
(a) Comp. Strict Fuzzy Strict Fuzzy
P R F P R F P R F P R F

Ehrmama .695 .632 .657 .801 .707 .751 .681 .494 .573 .735 .534 .618
Ertim .042 .045 .043 .074 .080 .077 - - - - - -
L3i .680 .732 .657 .773 .832 .801 .595 .698 .642 .654 .768 .707

(b) Nested
Ehrmama .397 .280 .329 .448 .317 .371 - - - - - -
L3i .337 .402 .367 .357 .427 .389 .471 .562 .513 .517 .616 .562

Table 7: Results for nested entities and entity components.

Fig. 4: F1 score as a function of time for the 5 best systems for NERC (top) and
end-to-end EL (bottom) for the languages French (left) and German (right). The
x-axis shows 20-years time buckets (e.g. 1790 = 1790-1809).

results and not only the top one, performances of all systems increase by about
1 (cut-off @3) to 2 (cut-off @5) points, showing the importance of candidate
ranking (cf. Table 9).

System-based observations. With L3i, the HIPE 2020 campaign has a clear
overall winner on NERC coarse and fine, literal and metonymic entities, compo-
nents, as well as EL. The one exception is EL only for English, where Inria’s
entity-fishing system outperforms L3i. L3i is particularly convincing in terms of
F1, as it consistently keeps precision and recall in good balance (even trending
toward recall many times). Other systems, e.g. Inria, Ehrmama, or the base-
line, typically suffer from a bias towards precision. It seems that actively tackling
the problems of OCR noise, word hyphenation and sentence segmentation helps
to achieve better recall.

Time-based observations. In order to gauge the impact of the article’s pub-
lication date on system performances, we analyze the variation of F1 scores as
a function of time (see Fig. 4). The initial hypothesis here was that the older
the article, the more difficult it is to extract and link the mentions it contains.
In general, there does not seem to be a strong correlation between the article’s
publication date and F1 scores. In the specific case of EL, this finding is in line
with the uniform distribution of NIL entities across time (see Section 4).

(a) NERC-Coarse.

(b) End-to-end EL with the relaxed evaluation regime and a cutoff @3.

Fig. 5: Impact of OCR noise: distribution of performances across systems on
entities with different noise level severity for NERC (a) and end-to-end EL (b).
End-to-end EL French German English
(a) Literal Strict Fuzzy Strict Fuzzy Strict Fuzzy
P R F P R F P R F P R F P R F P R F

Ertim .150 .084 .108 .150 .084 .108 - - - - - - - - - - - -
Irisa .446 .399 .421 .465 .417 .439 - - - - - - - - - - - -
L3i .594 .602 .598 .613 .622 .617 .531 .538 .534 .553 .561 .557 .523 .539 .531 .523 .539 .531
Sbb .594 .310 .407 .616 .321 .422 .540 .304 .389 .561 .315 .403 .257 .097 .141 .257 .097 .141
Uva-ilps .352 .195 .251 .353 .196 .252 .245 .272 .258 .255 .283 .268 .249 .375 .300 .249 .375 .300

Baseline .206 .342 .257 .257 .358 .270 .173 .187 .180 .188 .203 .195 .220 .263 .239 .220 .263 .239

(b) Meto.
Irisa .023 .295 .043 .041 .527 .076 - - - - - - - - - - - -
L3i .236 .402 .297 .366 .625 .462 .324 .508 .396 .384 .602 .469 .172 .200 .185 .172 .200 .185

Baseline .002 .027 .004 .008 .098 .015 .025 .136 .042 .026 .144 .044 .004 .040 .007 .004 .040 .007

EL only French German English
Strict Fuzzy Strict Fuzzy Strict Fuzzy
(a) Literal P R F P R F P R F P R F P R F P R F

Inria .585 .650 .616 .604 .670 .635 - - - - - - .633 .685 .658 .633 .685 .658
Irisa .475 .473 .474 .492 .491 .492 - - - - - - - - - - - -
L3i .640 .638 .639 .660 .657 .659 .581 .582 .582 .601 .602 .602 .593 .593 .593 .593 .593 .593
Sbb .677 .371 .480 .699 .383 .495 .615 .349 .445 .636 .361 .461 .344 .119 .177 .344 .119 .177
Uva.ilps - - - - - - - - - - - - .607 .580 .593 .607 .580 .593

Baseline .502 .495 .498 .516 .508 .512 .420 .416 .418 .440 .435 .437 .506 .506 .506 .506 .506 .506

(b) Meto.
Irisa .025 .357 .047 .041 .580 .076 - - - - - - - - - - - -
L3i .303 .446 .361 .461 .679 .549 .443 .627 .519 .515 .729 .604 .286 .480 .358 .286 .480 .358
Uva.ilps - - - - - - - - - - - - .031 .058 .031 .031 .058 .031

Baseline .213 .312 .254 .323 .473 .384 .265 .373 .310 .331 .466 .387 .219 .280 .246 .219 .280 .246

Table 8: Results for end-to-end EL (top) and EL only (bottom) with P, R and
F1 @1.

French German English
@3 P R F P R F P R F

Irisa .530 .463 .494 - - - - - -
L3i .676 .686 .681 .621 .630 .626 .627 .649 .638
Sbb .624 .325 .428 .590 .332 .425 .299 .112 .163
Uva.ilps .393 .218 .281 .300 .332 .315 .285 .429 .343
@5 P R F P R F P R F

Irisa .554 .497 .524 - - - - - -
L3i .695 .705 .700 .627 .636 .632 .651 .674 .662
Sbb .629 .328 .431 .601 .338 .432 .299 .112 .163
Uva.ilps .397 .220 .283 .311 .345 .327 .304 .458 .366

Table 9: Results for EL-only with fuzzy P, R and F1 @3 and @5.
Impact of OCR noise. To assess the impact of noisy entities on the task
of NERC and EL, we evaluated systems’ performances on various noise levels
(see Fig. 5). The level of noise is defined as the length-normalized Levenshtein
distance between the surface form of an entity and its manual transcription.
There is a remarkable difference between the performances for noisy and non-
noisy mentions on both NERC and EL. Already as little noise as 0.1 severely
hurts systems’ abilities to predict an entity and may cut their performance by
half. Interestingly, EL also suffers badly from little noise (norm. lev. dist. >
0.0 and < 0.1), even when provided with gold NERC annotations (EL only,
not shown in the plot). Slightly and medium noisy mentions (norm. lev. dist.
> 0.0 and < 0.3) show a similar impact, while for highly noisy mentions, the
performance deteriorates further. We can observe the greatest variations between
systems at the medium noise level, suggesting that the most robust systems get
their competitive advantage when dealing with medium noisiness. On the effect
of OCR noise on NERC, [104] claim that OCR errors impact more geo-political
(GPE) mentions than persons or dates; in our breakdown of OCR noise impact
by type, we can confirm that claim for little noise only (norm. lev. dist. > 0.0
and < 0.1), while this trend turns into the opposite for highly noisy entities.

8 Conclusion and Perspectives

From the perspective of natural language processing, the HIPE evaluation lab
provided the opportunity to test the robustness of NERC and EL approaches
against challenging historical material and to gain new insights with respect
to domain and language adaptation. With regard to NERC, results show that
it is possible to design systems capable of dealing with historical and noisy
inputs, whose performances compete with those obtained on contemporary texts.
Entity linking, as well as the processing of metonymy and nested entities remain
challenging aspects of historical NE processing (the latter two probably due to
the limited amount of annotated material). The results across the three languages
present in the HIPE 2020 campaign suggest that performances mainly depend on
the amount of the available in-domain training material. The evaluation study on
influence of OCR noisy on performance confirmed the expectation of degraded
quality for NERC and EL if more OCR errors are present. More surprising is
the fact that neither NERC nor EL performance seem to correlate with the date
of publication.
From the perspective of digital humanities, the lab’s outcomes will help DH
practitioners in mapping state-of-the-art solutions for NE processing on histor-
ical texts, and in getting a better understanding of what is already possible as
opposed to what is still challenging. Most importantly, digital scholars are in
need of support to explore the large quantities of digitized text they currently
have at hand, and NE processing is high on the agenda. Such processing can
support research questions in various domains (e.g. history, political science, lit-
erature, historical linguistics) and knowing about their performance is crucial in
order to make an informed use of the processed data.
Overall, HIPE has contributed to advance the state of the art in semantic
indexing of historical newspapers and, more generally, of historical material. As
future work, we intend to explore several directions for a potential second edition
of HIPE: expanding the language spectrum, strengthening the already covered
languages by providing more training data, considering other types of historical
documents, and exploring to what extent the improvements shown in HIPE can
be transferred to similar tasks in other domains, or to linking problems that
require knowledge bases other than Wikidata.

Acknowledgements
This HIPE evaluation lab would not have been possible without the interest and
commitment of many. We express our warmest thanks to: the Swiss newspa-
pers NZZ and Le Temps, and the Swiss and Luxembourg national libraries for
sharing part of their data in the frame of the impresso project; Camille Watter,
Gerold Schneider, Emmanuel Decker and Ilaria Comes for their commitment and
hard work with the construction of the data set; the inception project team for
its valuable and efficient support with the annotation tool; Richard Eckart de
Castillo, Clemens Neudecker, Sophie Rosset and David Smith for their encour-
agement and guidance as part of the HIPE advisory board; and, finally, the 13
teams who embarked in this first HIPE edition, for their patience and scientific
involvement. HIPE is part of the research activities of the project “impresso –
Media Monitoring of the Past”, for which we also gratefully acknowledge the
financial support of the Swiss National Science Foundation under grant number
CR-SII5 173719.
Bibliography

[1] ACE05: The ACE 2005 (ACE05) Evaluation Plan. Tech. rep., NIST
ACE (10 2005), http://www.itl.nist.gov/iad/mig/tests/ace/2005/
doc/ace05-evalplan.v3.pdf
[2] ACE08: ACE08 Evaluation Plan v1.2d. Tech. rep., NIST ACE
(04 2008), http://www.itl.nist.gov/iad/mig//tests/ace/2008/doc/
ace08-evalplan.v1.2d.pdf
[3] Aguilar, S.T., Tannier, X., Chastang, P.: Named entity recognition applied
on a data base of medieval latin charters. the case of chartae burgundiae.
In: 3rd International Workshop on Computational History (HistoInformat-
ics 2016) (2016)
[4] Ahmed, S., Stoeckel, M., Driller, C., Pachzelt, A., Mehler, A.: BIOfid
Dataset: Publishing a German Gold Standard for Named Entity Recog-
nition in Historical Biodiversity Literature. In: Proceedings of the 23rd
Conference on Computational Natural Language Learning (CoNLL).
pp. 871–880. Association for Computational Linguistics, Hong Kong,
China (Nov 2019). https://doi.org/10.18653/v1/K19-1081, https://www.
aclweb.org/anthology/K19-1081
[5] Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.:
FLAIR: An easy-to-use framework for state-of-the-art NLP. In: Proceed-
ings of the 2019 conference of the north American chapter of the associ-
ation for computational linguistics (demonstrations). pp. 54–59. Associa-
tion for Computational Linguistics, Minneapolis, Minnesota (Jun 2019),
https://www.aclweb.org/anthology/N19-4010
[6] Akbik, A., Blythe, D., Vollgraf, R.: Contextual String Embeddings for
Sequence Labeling. In: Proceedings of the 27th International Confer-
ence on Computational Linguistics. pp. 1638–1649. Association for Com-
putational Linguistics, Santa Fe, New Mexico, USA (Aug 2018), http:
//www.aclweb.org/anthology/C18-1139
[7] Artiles, J., Gonzalo, J., Sekine, S.: WePS 2 evaluation campaign: overview
of the web people search clustering task. In: Proceedings of the 2nd Web
People Search evaluation workshop (WePS 2009), collocated to the WWW
conference (2009)
[8] Baldwin, T., de Marneffe, M.C., Han, B., Kim, Y.B., Ritter,
A., Xu, W.: Shared tasks of the 2015 workshop on noisy user-
generated text: Twitter lexical normalization and named entity recog-
nition. In: Proceedings of the Workshop on Noisy User-generated
Text. pp. 126–135. Association for Computational Linguistics, Beijing,
China (Jul 2015). https://doi.org/10.18653/v1/W15-4319, https://www.
aclweb.org/anthology/W15-4319
[9] Benikova, D., Biemann, C., Kisselew, M., Pado, S.: Germeval 2014
named entity recognition shared task: Companion paper (2014), http:
//nbn-resolving.de/urn:nbn:de:gbv:hil2-opus-3006
[10] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word
vectors with subword information. Transactions of the Association for
Computational Linguistics 5, 135–146 (2017), https://www.aclweb.org/
anthology/Q17-1010
[11] Bollmann, M.: A Large-Scale Comparison of Historical Text Normaliza-
tion Systems. In: Proceedings of the 2019 Conference of the North Amer-
ican Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers). pp. 3885–
3898. Association for Computational Linguistics, Minneapolis, Minnesota
(2019). https://doi.org/10.18653/v1/N19-1389
[12] Borin, L., Kokkinakis, D., Olsson, L.J.: Naming the past: Named entity
and animacy recognition in 19th century Swedish literature. In: Proceed-
ings of the Workshop on Language Technology for Cultural Heritage Data
(LaTeCH 2007). pp. 1–8 (2007)
[13] Boros, E., Linhares Pontes, E., Cabrera-Diego, L.A., Hamdi, A., Moreno,
J.G., Sidère, N., Doucet, A.: Robust Named Entity Recognition and Link-
ing on Historical Multilingual Documents. In: Cappellato, L., Eickhoff, C.,
Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of
CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS
(2020)
[14] Bos, J., Basile, V., Evang, K., Venhuizen, N.J., Bjerva, J.: The Gronin-
gen meaning bank. In: Handbook of Linguistic Annotation, pp. 463–496.
Springer Netherlands (2017)
[15] Bunescu, R., Paşca, M.: Using encyclopedic knowledge for named entity
disambiguation. In: 11th Conference of the European Chapter of the Asso-
ciation for Computational Linguistics. Association for Computational Lin-
guistics, Trento, Italy (Apr 2006), https://www.aclweb.org/anthology/
E06-1002
[16] Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.): CLEF 2020 Work-
ing Notes. Working Notes of CLEF 2020 - Conference and Labs of the
Evaluation Forum. CEUR Workshop Proceedings (2020)
[17] Chiron, G., Doucet, A., Coustaty, M., Visani, M., Moreux, J.P.: Impact
of OCR Errors on the Use of Digital Libraries: Towards a Better Access
to Information. In: Proceedings of the 17th ACM/IEEE Joint Conference
on Digital Libraries. pp. 249–252. JCDL ’17, IEEE Press, Piscataway, NJ,
USA (2017), http://dl.acm.org/citation.cfm?id=3200334.3200364
[18] Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-
CNNs. Transactions of the Association for Computational Linguistics 4,
357–370 (2016). https://doi.org/10.1162/tacl a 00104
[19] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa,
P.: Natural language processing (almost) from scratch. Journal of Machine
Learning Research 12(Aug), 2493–2537 (2011)
[20] Cucerzan, S.: Large-scale named entity disambiguation based on Wikipedia
data. In: Proceedings of the 2007 Joint Conference on Empirical Methods
in Natural Language Processing and Computational Natural Language
Learning (EMNLP-CoNLL). pp. 708–716. Association for Computational
Linguistics, Prague, Czech Republic (Jun 2007), https://www.aclweb.
org/anthology/D07-1074
[21] Dekhili, G., Sadat, F.: Hybrid Statistical and Attentive Deep Neural Ap-
proach for Named Entity Recognition in Historical Newspapers. In: Cap-
pellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working
Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evalu-
ation Forum. CEUR-WS (2020)
[22] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training
of deep bidirectional transformers for language understanding. CoRR
abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805
[23] Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S.,
Weischedel, R.: The automatic content extraction (ACE) program, tasks,
data, and evaluation. In: Proceedings of the Fourth International Con-
ference on Language Resources and Evaluation (LREC-2004). European
Language Resources Association (ELRA), Lisbon, Portugal (May 2004),
http://www.lrec-conf.org/proceedings/lrec2004/pdf/5.pdf
[24] Dupont, Y., Dinarelli, M., Tellier, I., Lautier, C.: Structured Named En-
tity Recognition by Cascading CRFs. In: Intelligent Text Processing and
Computational Linguistics (CICling) (2017)
[25] Ehrmann, M., Colavizza, G., Rochat, Y., Kaplan, F.: Diachronic Evalua-
tion of NER Systems on Old Newspapers. In: Proceedings of the 13th Con-
ference on Natural Language Processing (KONVENS 2016)). pp. 97–107.
Bochumer Linguistische Arbeitsberichte (2016), https://infoscience.
epfl.ch/record/221391?ln=en
[26] Ehrmann, M., Nouvel, D., Rosset, S.: Named Entity Resources - Overview
and Outlook. In: Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Gro-
belnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J.,
Piperidis, S. (eds.) Proceedings of the Tenth International Conference on
Language Resources and Evaluation (LREC 2016). European Language
Resources Association (ELRA), Paris, France (may 2016)
[27] Ehrmann, M., Romanello, M., Bircher, S., Clematide, S.: Introducing the
CLEF 2020 HIPE Shared Task: Named Entity Recognition and Linking on
Historical Newspapers. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells,
P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information Re-
trieval. pp. 524–532. Lecture Notes in Computer Science, Springer Interna-
tional Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-45442-
5 68
[28] Ehrmann, M., Romanello, M., Clematide, S., Ströbel, P.B., Barman, R.:
Language Resources for Historical Newspapers: the Impresso Collection.
In: Proceedings of The 12th Language Resources and Evaluation Con-
ference. pp. 958–968. European Language Resources Association, Mar-
seille, France (May 2020), https://www.aclweb.org/anthology/2020.
lrec-1.121
[29] Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.:
HIPE - Shared Task Participation Guidelines (v1.1) (2020).
https://doi.org/10.5281/zenodo.3677171
[30] Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.:
Impresso Named Entity Annotation Guidelines (Jan 2020).
https://doi.org/10.5281/zenodo.3604227
[31] Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Overview of
CLEF HIPE 2020: Named Entity Recognition and Linking on Historical
Newspapers. In: Arampatzis, A., Kanoulas, E., Tsikrika, T., Vrochidis,
S., Joho, H., Lioma, C., Eickhoff, C., Névéol, A., Cappellato, L., Ferro,
N. (eds.) Experimental IR Meets Multilinguality, Multimodality, and In-
teraction. Proceedings of the 11th International Conference of the CLEF
Association (CLEF 2020). Lecture Notes in Computer Science (LNCS),
vol. 12260. Springer (2020)
[32] El Vaigh, C.B., Goasdoué, F., Gravier, G., Sébillot, P.: Using Knowl-
edge Base Semantics in Context-Aware Entity Linking. In: Proceedings
of the ACM Symposium on Document Engineering 2019. pp. 1–10. Do-
cEng ’19, Association for Computing Machinery, Berlin, Germany (Sep
2019). https://doi.org/10.1145/3342558.3345393
[33] El Vaigh, C.B., Le Noé-Bienvenu, G., Gravier, G., Sébillot, P.: IRISA Sys-
tem for Entity Detection and Linking at HIPE’20. In: Cappellato, L., Eick-
hoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working
Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum.
CEUR-WS (2020)
[34] Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local
information into information extraction systems by Gibbs sam-
pling. In: Proceedings of the 43rd Annual Meeting of the Asso-
ciation for Computational Linguistics (ACL’05). pp. 363–370. As-
sociation for Computational Linguistics, Ann Arbor, Michigan (Jun
2005). https://doi.org/10.3115/1219840.1219885, https://www.aclweb.
org/anthology/P05-1045
[35] Frontini, F., Brando, C., Ganascia, J.G.: Semantic Web Based Named En-
tity Linking for Digital Humanities and Heritage Texts. In: Zucker, A.,
Draelants, I., Zucker, C.F., Monnin, A. (eds.) First International Work-
shop Semantic Web for Scientific Heritage at the 12th ESWC 2015 Confer-
ence. Portorož, Slovenia (2015), https://hal.archives-ouvertes.fr/
hal-01203358
[36] Galibert, O., Rosset, S., Grouin, C., Zweigenbaum, P., Quintard, L.: Ex-
tended Named Entity Annotation on OCRed Documents : From Corpus
Constitution to Evaluation Campaign. In: Proceedings of the Eighth con-
ference on International Language Resources and Evaluation. pp. 3126–
3131. Istanbul, Turkey (2012)
[37] Galibert, O., Leixa, J., Adda, G., Choukri, K., Gravier, G.: The ETAPE
speech processing evaluation. In: Proceedings of the Ninth International
Conference on Language Resources and Evaluation (LREC’14). pp. 3995–
3999. European Language Resources Association (ELRA), Reykjavik, Ice-
land (May 2014), http://www.lrec-conf.org/proceedings/lrec2014/
pdf/1027_Paper.pdf
[38] Galibert, O., Rosset, S., Grouin, C., Zweigenbaum, P., Quintard, L.: Struc-
tured and extended named entity evaluation in automatic speech transcrip-
tions. In: IJCNLP. pp. 518–526 (2011)
[39] Galliano, S., Geoffrois, E., Mostefa, M., Choukri, K., Bonastre, J.f.,
Gravier, G.: The ESTER phase II evaluation campaign for the rich tran-
scription of French broadcast news. In: Proceedings of the 9th Euro-
pean Conference on Speech Communication and Technology (INTER-
SPEECH’05. pp. 1149–1152 (2005)
[40] Ganea, O.E., Hofmann, T.: Deep Joint Entity Disambiguation with Local
Neural Attention. In: Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing. pp. 2619–2629 (2017)
[41] Ghannay, S., Grouin, C., Lavergne, T.: Experiments from LIMSI at the
French Named Entity Recognition Coarse-grained task. In: Cappellato,
L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes.
Working Notes of CLEF 2020 - Conference and Labs of the Evaluation
Forum. CEUR-WS (2020)
[42] Goulart, R.R.V., Strube de Lima, V.L., Xavier, C.C.: A system-
atic review of named entity recognition in biomedical texts. Jour-
nal of the Brazilian Computer Society 17(2), 103–116 (Jun 2011).
https://doi.org/10.1007/s13173-011-0031-9, https://doi.org/10.1007/
s13173-011-0031-9
[43] Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning
word vectors for 157 languages. In: Proceedings of the Eleventh Interna-
tional Conference on Language Resources and Evaluation (LREC 2018).
European Language Resources Association (ELRA), Miyazaki, Japan
(May 2018), https://www.aclweb.org/anthology/L18-1550
[44] Grishman, R., Sundheim, B.: Design of the MUC-6 evaluation. In: Sixth
Message Understanding Conference (MUC-6): Proceedings of a Conference
Held in Columbia, Maryland (1995)
[45] Grishman, R., Sundheim, B.: Message Understanding Conference - 6: A
Brief History. In: Proceedings of the 16th International Conference on
Computational Linguistics - Volume 1. pp. 466–471. COLING’96, As-
sociation for Computational Linguistics, Stroudsburg, PA, USA (1996).
https://doi.org/10.3115/992628.992709, event-place: Copenhagen, Den-
mark
[46] Grover, C., Givon, S., Tobin, R., Ball, J.: Named Entity Recognition
for Digitised Historical Texts. In: Proceedings of the Sixth Interna-
tional Conference on Language Resources and Evaluation (LREC’08). Eu-
ropean Language Resources Association (ELRA), Marrakech, Morocco
(May 2008), http://www.lrec-conf.org/proceedings/lrec2008/pdf/
342_paper.pdf
[47] Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol,
M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named
entities in text. In: EMNLP (2011)
[48] Hooland, S.V., De Wilde, M., Verborgh, R., Steiner, T., Van de Walle,
R.: Exploring entity recognition and disambiguation for cultural heritage
collections. Digital Scholarship in the Humanities 30(2), 262–279 (2015).
https://doi.org/10.1093/llc/fqt067
[49] Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence
tagging. CoRR abs/1508.01991 (2015), http://arxiv.org/abs/1508.
01991
[50] Hubková, H.: Named-entity recognition in Czech historical texts: Using a
CNN-BiLSTM neural network model. Ph.D. thesis (2019)
[51] van Hulst, J.M., Hasibi, F., Dercksen, K., Balog, K., de Vries, A.P.: REL:
An entity linker standing on the shoulders of giants. In: Proceedings of the
43rd International ACM SIGIR Conference on Research and Development
in Information Retrieval. SIGIR ’20, ACM (2020)
[52] Jones, A., Crane, G.: The challenge of virginia banks: an evaluation of
named entity analysis in a 19th-century newspaper collection. In: Pro-
ceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries
(JCDL’06). pp. 31–40. IEEE (2006)
[53] Kaplan, F., di Lenardo, I.: Big Data of the Past. Frontiers in Digital Hu-
manities 4 (2017). https://doi.org/10.3389/fdigh.2017.00012
[54] Kettunen, K., Ruokolainen, T.: Names, Right or Wrong: Named En-
tities in an OCRed Historical Finnish Newspaper Collection. In: Pro-
ceedings of the 2nd International Conference on Digital Access to Tex-
tual Cultural Heritage - DATeCH2017. pp. 181–186. ACM Press (2017).
https://doi.org/10.1145/3078081.3078084
[55] Kim, J.D., Ohta, T., Tateisi, Y., Tsujii, J.: Genia corpus—a semantically
annotated corpus for bio-textmining. Bioinformatics 19(suppl 1), i180–i182
(2003)
[56] Kim, S.M., Cassidy, S.: Finding Names in Trove: Named Entity Recog-
nition for Australian Historical Newspapers. In: Proceedings of the Aus-
tralasian Language Technology Association Workshop 2015. pp. 57–65.
Parramatta, Australia (Dec 2015)
[57] Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The
inception platform: Machine-assisted and knowledge-oriented interactive
annotation. In: Proceedings of the 27th International Conference on Com-
putational Linguistics: System Demonstrations. pp. 5–9 (2018)
[58] Kolitsas, N., Ganea, O.E., Hofmann, T.: End-to-End Neural Entity Link-
ing. In: Proceedings of the 22nd Conference on Computational Natural
Language Learning. pp. 519–529. Association for Computational Linguis-
tics, Brussels, Belgium (Oct 2018). https://doi.org/10.18653/v1/K18-1050
[59] Krippendorff, K.: Content analysis: An introduction to its methodology.
Sage publications (1980)
[60] Kristanti, T., Romary, L.: DeLFT and entity-fishing: Tools for CLEF HIPE
2020 Shared Task. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A.
(eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Confer-
ence and Labs of the Evaluation Forum. CEUR-WS (2020)
[61] Labusch, K., Neudecker, C.: Named Entity Disambiguation and Linking
Historic Newspaper OCR with BERT. In: Cappellato, L., Eickhoff, C.,
Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of
CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS
(2020)
[62] Labusch, K., Neudecker, C., Zellhöfer, D.: BERT for Named Entity Recog-
nition in Contemporary and Historic German. In: Preliminary proceed-
ings of the 15th Conference on Natural Language Processing (KONVENS
2019): Long Papers. pp. 1–9. German Society for Computational Linguis-
tics & Language Technology, Erlangen, Germany (2019)
[63] Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.:
Neural Architectures for Named Entity Recognition. arXiv:1603.01360 [cs]
(Mar 2016), http://arxiv.org/abs/1603.01360
[64] Lavergne, T., Cappé, O., Yvon, F.: Practical very large scale CRFs. In:
Proceedings of the 48th Annual Meeting of the Association for Computa-
tional Linguistics. pp. 504–513. Association for Computational Linguistics
(2010)
[65] Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR
Quality on Named Entity Linking. In: Jatowt, A., Maeda, A., Syn, S.Y.
(eds.) Digital Libraries at the Crossroads of Digital Information for the
Future. pp. 102–115. Lecture Notes in Computer Science, Springer In-
ternational Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-
34058-2 11
[66] Magnini, B., Cappelli, A., Tamburini, F., Bosco, C., et al., A.M.: Eval-
uation of natural language tools for italian: Evalita 2007. In: Proc. of
the 6th International Conference on Language Resources and Evaluation
(LREC’08). Marrakech, Morocco (2008)
[67] Makhoul, J., Kubala, F., Schwartz, R., Weischedel, R.: Performance mea-
sures for information extraction. In: In Proceedings of DARPA Broadcast
News Workshop. pp. 249–252 (1999)
[68] Markert, K., Nissim, M.: SemEval-2007 task 08: Metonymy resolution at
SemEval-2007. In: Proceedings of the Fourth International Workshop on
Semantic Evaluations (SemEval-2007). pp. 36–41. Association for Compu-
tational Linguistics, Prague, Czech Republic (Jun 2007), https://www.
aclweb.org/anthology/S07-1007
[69] Martin, L., Muller, B., Ortiz Suárez, P.J., Dupont, Y., Romary, L., de la
Clergerie, É., Seddah, D., Sagot, B.: CamemBERT: a tasty French lan-
guage model. In: Proceedings of the 58th Annual Meeting of the As-
sociation for Computational Linguistics. pp. 7203–7219. Association for
Computational Linguistics, Online (Jul 2020), https://www.aclweb.org/
anthology/2020.acl-main.645
[70] May, P.: German ELMo Model (2019), https://github.com/
t-systems-on-site-services-gmbh/german-elmo-model
[71] Nadeau, D., Sekine, S.: A survey of named entity recognition and classifi-
cation. Lingvisticae Investigationes 30(1), 3–26 (2007)
[72] Neudecker, C., Antonacopoulos, A.: Making Europe’s Historical News-
papers Searchable. In: 2016 12th IAPR Workshop on Document Anal-
ysis Systems (DAS). pp. 405–410. IEEE, Santorini, Greece (Apr 2016).
https://doi.org/10.1109/DAS.2016.83
[73] Neudecker, C., Wilms, L., Faber, W.J., van Veen, T.: Large-scale refine-
ment of digital historic newspapers with named entity recognition. In: Proc
IFLA Newspapers/GENLOC Pre-Conference Satellite Meeting (2014)
[74] Nguyen, D.B., Hoffart, J., Theobald, M., Weikum, G.: Aida-light: High-
throughput named-entity disambiguation. In: LDOW (2014)
[75] Nouvel, D., Antoine, J.Y., Friburger, N.: Pattern Mining for Named Entity
Recognition. LNCS/LNAI Series 8387i (post-proceedings LTC 2011)
(2014)
[76] Nouvel, D., Zagabe Seruti, J.C.: Adapting a pre-neural ML NER System
to Historical Data. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A.
(eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Confer-
ence and Labs of the Evaluation Forum. CEUR-WS (2020)
[77] Okazaki, N.: CRFsuite: a fast implementation of Conditional Random
Fields (CRFs) (2007), http://www.chokkan.org/software/crfsuite/
[78] Ortiz Suárez, P.J., Dupont, Y., Lejeune, G., Tian, T.: SinNer@CLEF-
HIPE2020: Sinful adaptation of SotA models for Named Entity Recogni-
tion in historical French and German newspapers. In: Cappellato, L., Eick-
hoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working
Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum.
CEUR-WS (2020)
[79] Ortiz Suárez, P.J., Dupont, Y., Muller, B., Romary, L., Sagot, B.: Estab-
lishing a new state-of-the-art for French named entity recognition. In: Pro-
ceedings of The 12th Language Resources and Evaluation Conference. pp.
4631–4638. European Language Resources Association, Marseille, France
(May 2020), https://www.aclweb.org/anthology/2020.lrec-1.569
[80] Padró, L., Stanilovsky, E.: FreeLing 3.0: Towards wider multilinguality.
In: Proceedings of the Eighth International Conference on Language Re-
sources and Evaluation (LREC’12). pp. 2473–2479. European Language
Resources Association (ELRA), Istanbul, Turkey (May 2012), http://
www.lrec-conf.org/proceedings/lrec2012/pdf/430_Paper.pdf
[81] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word
representation. In: EMNLP. vol. 14, pp. 1532–43 (2014)
[82] Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee,
K., Zettlemoyer, L.: Deep Contextualized Word Representations. In:
Proceedings of the 2018 Conference of the North American Chap-
ter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long Papers). pp. 2227–2237. Associ-
ation for Computational Linguistics, New Orleans, Louisiana (2018).
https://doi.org/10.18653/v1/N18-1202
[83] Piotrowski, M.: Natural language processing for historical texts. Synthesis
Lectures on Human Language Technologies 5(2), 1–157 (2012)
[84] Piskorski, J., Ehrmann, M.: On Named Entity Recognition in Targeted
Twitter Streams in Polish. In: Proceedings of the 4th Biennial International
Workshop on Balto-Slavic Natural Language Processing, co-located with
ACL (BSNLP 2013). pp. 84–93. Sofia, Bulgaria (2013)
[85] Plank, B.: What to do about non-standard (or non-canonical) language
in NLP. In: Proceedings of the 13th Conference on Natural Language
Processing (KONVENS 2016)). Bochumer Linguistische Arbeitsberichte
(2016)
[86] Provatorova, V., Vakulenko, S., Kanoulas, E., Dercksen, K., van Hulst,
J.M.: Named Entity Recognition and Linking on Historical Newspapers:
UvA.ILPS & REL at CLEF HIPE 2020. In: Cappellato, L., Eickhoff, C.,
Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of
CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS
(2020)
[87] Rao, D., McNamee, P., Dredze, M.: Entity linking: Finding extracted en-
tities in a knowledge base. In: Poibeau, T., Saggion, H., Piskorski, J.,
Yangarber, R. (eds.) Multi-source, multilingual information extraction
and summarization, pp. 93–115. Springer Berlin Heidelberg, Berlin, Hei-
delberg (2013). https://doi.org/10.1007/978-3-642-28569-1, https://doi.
org/10.1007/978-3-642-28569-1
[88] Riedl, M., Padó, S.: A named entity recognition shootout for german. In:
Proceedings of the 56th Annual Meeting of the Association for Computa-
tional Linguistics (Volume 2: Short Papers). pp. 120–125 (2018)
[89] Ritter, A., Clark, S., Etzioni, O., et al.: Named entity recognition in tweets:
an experimental study. In: Proceedings of the Conference on Empirical
Methods in Natural Language Processing. pp. 1524–1534 (2011)
[90] Rodriquez, K.J., Bryant, M., Blanke, T., Luszczynska, M.: Comparison
of named entity recognition tools for raw OCR text. In: Jancsary, J.
(ed.) 11th Conference on Natural Language Processing, KONVENS 2012,
Empirical Methods in Natural Language Processing, Vienna, Austria,
September 19-21, 2012. Scientific series of the ÖGAI, vol. 5, pp. 410–414.
ÖGAI, Wien, Österreich (2012), http://www.oegai.at/konvens2012/
proceedings/60_rodriquez12w/
[91] Rosset, S., Grouin, C., Fort, K., Galibert, O., Kahn, J., Zweigenbaum, P.:
Structured named entities in two distinct press corpora: Contemporary
broadcast news and old newspapers. In: Proceedings of the 6th Linguistic
Annotation Workshop. pp. 40–48. Association for Computational Linguis-
tics (2012)
[92] Rosset, Sophie, Grouin, Cyril, Zweigenbaum, Pierre: Entités nommées
structurées : guide d’annotation Quaero. NOTES et DOCUMENTS 2011-
04, LIMSI-CNRS (2011)
[93] Rovera, M., Nanni, F., Ponzetto, S.P., Goy, A.: Domain-specific named
entity disambiguation in historical memoirs. In: CEUR Workshop Pro-
ceedings. vol. 2006. RWTH (2017)
[94] Santos, D., Seco, N., Cardoso, N., Vilela, R.: HAREM: An Advanced NER
Evaluation Contest for Portuguese. In: Proceedings of the 5th International
Conference on Language Resources and Evaluation (LREC’06). pp. 1640–
1643. Genoa (2006)
[95] Schweter, S., Baiter, J.: Towards robust named entity recognition for his-
toric german. arXiv preprint arXiv:1906.07592 (2019)
[96] Schweter, S., März, L.: Triple E - Effective ensembling of embeddings and
language models for NER of historical German. In: Cappellato, L., Eick-
hoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working
Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum.
CEUR-WS (2020)
[97] Sekine, S., Sudo, K., Nobata, C.: Extended Named Entity Hierarchy. In:
Proceedings of The Third International Conference on Language Resources
and Evaluation (LREC). Iles Canaries , Espagne (2002)
[98] Sevgili, O., Shelmanov, A., Arkhipov, M., Panchenko, A., Biemann, C.:
Neural Entity Linking: A Survey of Models based on Deep Learning.
arXiv:2006.00575 [cs] (May 2020), http://arxiv.org/abs/2006.00575,
arXiv: 2006.00575
[99] Shen, W., Wang, J., Han, J.: Entity Linking with a Knowledge
Base: Issues, Techniques, and Solutions. IEEE Transactions on
Knowledge and Data Engineering 27(2), 443–460 (Feb 2015).
https://doi.org/10.1109/TKDE.2014.2327028, http://ieeexplore.
ieee.org/document/6823700/
[100] Smith, D.A., Cordell, R.: A Research Agenda for Historical and Multi-
lingual Optical Character Recognition. Tech. rep. (2018), http://hdl.
handle.net/2047/D20297452
[101] Sporleder, C.: Natural Language Processing for Cultural Heritage
Domains. Language and Linguistics Compass 4(9), 750–768 (2010).
https://doi.org/10.1111/j.1749-818X.2010.00230.x
[102] Sprugnoli, R.: Arretium or Arezzo? A Neural Approach to the Iden-
tification of Place Names in Historical Texts. In: Cabrio, E., Mazzei,
A., Tamburini, F. (eds.) Proceedings of the Fifth Italian Conference on
Computational Linguistics (CLiC-it 2018), Torino, Italy, December 10-
12, 2018. CEUR Workshop Proceedings, vol. 2253. CEUR-WS.org (2018),
http://ceur-ws.org/Vol-2253/paper26.pdf
[103] Straková, J., Straka, M., Hajic, J.: Neural architectures for nested NER
through linearization. In: Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics. pp. 5326–5331. Associa-
tion for Computational Linguistics, Florence, Italy (Jul 2019), https:
//www.aclweb.org/anthology/P19-1527
[104] van Strien, D., Beelen, K., Ardanuy, M., Hosseini, K., McGillivray,
B., Colavizza, G.: Assessing the Impact of OCR Quality on Down-
stream NLP Tasks. In: Proceedings of the 12th International Confer-
ence on Agents and Artificial Intelligence. pp. 484–496. SCITEPRESS
- Science and Technology Publications, Valletta, Malta (2020).
https://doi.org/10.5220/0009169004840496
[105] Terras, M.: The Rise of Digitization. In: Rikowski, R. (ed.) Digi-
tisation Perspectives, pp. 3–20. SensePublishers, Rotterdam (2011).
https://doi.org/10.1007/978-94-6091-299-3 1
[106] Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003
shared task: Language-independent named entity recognition. In: Pro-
ceedings of the seventh conference on Natural language learning at HLT-
NAACL 2003-Volume 4. pp. 142–147. Association for Computational Lin-
guistics (2003)
[107] Todorov, K., Colavizza, G.: Transfer Learning for Named Entity Recog-
nition in Historical Corpora. In: Cappellato, L., Eickhoff, C., Ferro, N.,
Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF
2020 - Conference and Labs of the Evaluation Forum. CEUR-WS (2020)
[108] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. CoRR
abs/1706.03762 (2017), http://arxiv.org/abs/1706.03762
[109] Vilain, M., Su, J., Lubar, S.: Entity Extraction is a Boring Solved Problem:
Or is It? In: Human Language Technologies 2007: The Conference of the
North American Chapter of the Association for Computational Linguis-
tics; Companion Volume, Short Papers. pp. 181–184. NAACL-Short ’07,
Association for Computational Linguistics (2007), http://dl.acm.org/
citation.cfm?id=1614108.1614154, event-place: Rochester, New York
[110] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cis-
tac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von
Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S.,
Drame, M., Lhoest, Q., Rush, A.M.: HuggingFace’s Transformers: State-
of-the-art Natural Language Processing. arXiv e-prints arXiv:1910.03771
(Oct 2019)