=Paper= {{Paper |id=Vol-2696/paper_199 |storemode=property |title=Hybrid Statistical and Attentive Deep Neural Approach for Named Entity Recognition in Historical Newspapers |pdfUrl=https://ceur-ws.org/Vol-2696/paper_199.pdf |volume=Vol-2696 |authors=Ghaith Dekhili,Fatiha Sadat |dblpUrl=https://dblp.org/rec/conf/clef/DekhiliS20 }} ==Hybrid Statistical and Attentive Deep Neural Approach for Named Entity Recognition in Historical Newspapers== https://ceur-ws.org/Vol-2696/paper_199.pdf
    Hybrid Statistical and Attentive Deep Neural
     Approach for Named Entity Recognition in
               Historical Newspapers

                         Ghaith Dekhili and Fatiha Sadat

     University of Quebec in Montreal, 201 President Kennedy avenue, H2X 3Y7
                             Montreal, Quebec, Canada.
            dekhili.ghaith@courrier.uqam.ca, sadat.fatiha@uqam.ca



       Abstract. Neural networks-based models have proved their efficiency
       on Named Entities Recognition, one of the well-known NLP task. Be-
       sides, attention mechanism has become an integral part of compelling
       sequence modeling and transduction models on various tasks. This tech-
       nique allows context representation in a sequence by taking into consid-
       eration neighboring words.
       In this study, we propose an architecture that involves BiLSTM layers
       combined with a CRF layer and an attention layer in between. This
       was augmented with pre-trained contextualized word embeddings and
       dropout layers. Moreover, apart from using word representations, we use
       character-based representations, extracted by CNN layers, to capture
       morphological and orthographic information.
       Our experiments show an improvement in the overall performance. We
       notice that our attentive neural model augmented with contextualized
       word embeddings gives higher scores compared to our baselines.
       To the best of our knowledge, there is no study which combines the
       application of attention mechanism and contextualized word embeddings
       on NER and historical newspapers.

       Keywords: Deep Neural Networks · Attention Mechanism · Contextu-
       alized Word Embeddings · Character Embeddings


1    Introduction
This work is done as part of the HIPE (Identifying Historical People, Places
and other Entities) shared task, “organised as a CLEF 2020 evaluation Lab and
dedicated to the evaluation of named entity processing on historical newspapers
in French, German and English” [11]. The shared task is organized as part of
“impresso Media Monitoring of the Past”, a project focused on information
extraction in historical newspapers. 1
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
  ber 2020, Thessaloniki, Greece.
1
  https://impresso-project.ch/
    Named Entity Recognition and Classification (NERC) is a sub-task of in-
formation extraction and Natural Language Processing (NLP). It consists on
identifying certain textual objects such as names of persons, organizations and
places.
    Early NER systems were based on handcrafted rules, lexicons, orthographic
features and external knowledge resources. This was followed by “feature engi-
neering based NER systems” and machine learning [25]. Starting with [7], neural
networks based systems with a minimum of feature engineering have become
popular. Such models are interesting because they typically do not need do-
main specific resources like in earlier systems, and are thus qualified to be more
domain independent. Many neural architectures have been introduced, most of
them based on some form of Recurrent Neural Networks (RNN) over characters,
sub-words and/or word embeddings [30].
    NER systems based on knowledge do not need labeled data as they rely on
lexicon resources and domain specific knowledge. These systems perform well in
cases where the lexicon is exhaustive, but fail when the information does not
exist in domain dictionaries [30]. A second drawback of these systems is that
they require domain experts to construct and maintain the knowledge resources.
Finally, these systems can be used only on domains and languages for which they
were designed, because of the specific features they had learned during training
[12].
    Supervised machine learning models learn how to make predictions during
training on couples of inputs and their expected outputs, and can be used in
place of handcrafted rules [30].
    NER task becomes more challenging when applied on historical and cultural
heritage collections. On the one side, inputs can be extremely noisy, with errors
which differ from the ones in tweet misspellings or speech transcription hesita-
tions [22, 5, 28]. On the other side, the language that we have is mostly of earlier
stage, “which renders usual external and internal evidences less effective (e.g.,
the usage of different naming conventions and presence of historical spelling vari-
ations)” [4, 3]. Finally, “archives and texts from the past are not as anglophone
as in today’s information society, making multilingual resources and processing
capacities even more essential” [26, 11]. In this context, the objective of CLEF
HIPE 2020 shared task is threefold:
    strengthening the robustness of existing approaches on non-standard in-
    puts; enabling performance comparison of NE processing on historical
    texts; and in the long run, fostering efficient semantic indexing of his-
    torical documents in order to support scholarship on digital cultural
    heritage collections [11].


2   Related Work
Main NER approaches are based on computational linguistics and machine learn-
ing. [13] proposed ProMiner which is based on a dictionary of synonyms to iden-
tify genes and proteins mentions in text and link them to their corresponding
ids in the dictionary. [27] presented an approach based on dictionaries as well
for NER in the medical domain. There are other well-known rules based NER
systems such as LaSIE-II [15], NetOwl [17] and FASTUS [1]. These systems are
mainly based on semantic and syntactic rules to recognize entities [20].
    Among machine learning applied techniques, we quote Hidden Markov Model
(HMM), Maximum Entropy, decision trees, Support Vector Machines (SVM)
and Conditional Random Fields (CRF). [18] proposed a CRF model and include
morphological features, Part-Of-Speech (POS) Tags, and words sequences. [16]
used a CRF too, and show that using Word2Vec pre-trained word embeddings
improves NER models performances.
    On the other hand, neural networks based models have proved their efficiency
on NER tasks. Long Short-Term Memory (LSTM) [14] based neural networks
have been widely used in different NLP applications thanks to their ability to
detect long-term dependencies. These models showed good results compared to
traditional approaches, even if they do not need dictionaries, Gazetteers or other
additional information. [6] presented a hybrid model by combining Bidirectional
LSTM (BiLSTM) and a Convolutional Neural Network (CNN). [19] introduced
a neural model similar to [6], based on BiLSTM combined with a CRF. [23] used
the attention mechanism to develop a model which takes advantage of sentence
level and document level hierarchical contextual representations. [8] introduced
BERT, acronym of Bidirectional Encoder Representations from Transformers,
“which is a language model designed for pre-training deep bidirectional repre-
sentations from unlabeled text by jointly conditioning on both left and right
context in all layers” [8]. The model obtained new state-of-the-art results on
eleven natural language processing tasks. For more details on related work we
refer the reader to [10].


3     Background on some Supervised ML Models

In this section we present some supervised machine learning models used in this
research, such as LSTM operating model, BiLSTM which is the combination of
two LSTMs, followed by a brief description of the CRF modelling model and its
usefulness.


3.1   The Long Short-Term Memory model

LSTM is a RNN architecture used in the field of deep learning. “This power-
ful family of connectionist models can capture time dynamics via cycles in the
graph” [14, 24].
    RNNs take as input a vectors sequence (x1 , x2 , ..., xn ) at time t and return the
hidden state vectors sequence (h1 , h2 , ..., hn ), which stocks information learned
in actual and previous steps. “Although RNNs can, in theory, learn long depen-
dencies, in practice they fail to do so and tend to be biased towards their most
recent inputs in the sequence”[2]. Thanks to their memory cells, LSTMs are able
to resolve this point by capturing useful information from previous states[14]. A
LSTM unit is updated at a time t using the following equations:

                           it = σ(Wi ht−1 + Ui xt + bi )                        (1)
                          ft = σ(Wf ht−1 + Uf xt + bf )                         (2)
                         c˜t = tanh(Wc ht−1 + Uc xt + bc )                      (3)
                              ct = ft     ct−1 + it     c˜t                     (4)
                           ot = σ(Wo ht−1 + Uo xt + bo )                        (5)
                                ht = ot     tanh(ct )                           (6)
      where:

 • σ is the element-wise sigmoid function
 •     is the element-wise product
 • xt is the input vector at time t
 • ht is the hidden state (could be called output) vector stocking useful infor-
   mation at (and before) time t
 • Ui , Uf , Uc , Uo are the weight matrices of different gates for input xt
 • Wi ,Wf ,Wc ,Wo denote the weight matrices for hidden state ht
 • bi , bf , bc , bo are the bias vectors

      [14, 24].


3.2     Bidirectional LSTM
                                            →
                                            −
A LSTM computes a hidden state vector ht representative of the left context in
a sentence at every step t [19]. To take advantage of information that we could
get from treating the same sentence in reverse, [9] proposed the BiLSTM model.
The idea is to use an another LSTM, to generate a second hidden state vector
←
−
ht representative of the right context in the sentence. Concatenating these two
                                       →
                                       − ←  −
vectors leads to a representation ht=[ ht ; ht ] of the word in its general context.
The resulting representation is useful for numerous tagging applications [19].


3.3     CRF

“A very simple but surprisingly effective tagging model is to use the ht ’s as fea-
tures to make independent tagging decisions for each output yt ”[21]. In sequence
labeling tasks, taking into consideration neighboring labels could be helpful while
analyzing a given input sentence, like in some “grammar” rules where a noun
more likely follows an adjective than a verb, this can be equivalent in NER to
the fact that I-ORG cannot follow I-PERS [24].
    Therefore, as in the research presented by [19], we model label sequence
jointly using a CRF, instead of modeling them independently.
3.4   Extracting Character Features Using a CNN

CNN layers have become ubiquitous in many NLP tasks. As in [6], we use for
each word a convolution and a max layer to extract a per-character feature and
optionally the character type to obtain at the end a new character embedding
with resulting features. A special token “PADDING” has been used on both
sides of words to keep the same length of sequences. In this section we present
a brief description of CNN applied to text [31].




                  Fig. 1. Basic representation of CNN layers [31].




Input Layer An input sequence x with n elements, each one is represented by a
d-dimensional vector, can be represented as a map of features of dimensionality
d x n. The bottom of figure 1 shows the input layer as a rectangle with multiple
columns [31].


Convolution Layer Convolution layer is employed to represent learning by
sliding w-grams over an input sequence (x1 , x2 , ..., xn ). We consider a vector ci
 Rwd as the concatenated embeddings of w entries (xi−w+1 , ..., xi ), where w is
the filter width and 0 < i < s + w. We pad embeddings of xi , where i < 1
or i > n, with zeros. We then represent the w-grams (xi−w+1 , ..., xi ) by a new
                                                        ∗
vector pi  Rd using the convolution weights W  Rd wd :

                                 pi = tanh(W ci + b)                            (7)

where b  Rd is the bias [31].
Maxpooling We use w-gram representations pi (i = 1...s+w−1) to generate the
input sequence x representation by applying maxpooling: xj = max(p1,j , p2,j , ...)
where (j = 1, ..., d) [31].

3.5    Attention mechanism
“The attention mechanism has become an integral part of compelling sequence
modeling and transduction models in various tasks”[29]. This technique allows
to represent the context in a sequence by taking into consideration neighboring
words [29]. For more details on attention mechanism we refer the reader to [29].


4     Our Proposed Approach
This study aims to compare the effectiveness of our proposed attentive neural
approach in recognizing and classifying NEs in historical newspapers with a
comparison to a statistical model augmented with orthographic features and
two other neural models.

4.1    Statistical approach
Statistical approaches based on CRF, SVM or Perceptron have proven good
performances using only handcrafted features in many NLP tasks such as NERC
[6]. In our work, we use CRFsuite2 implementation of CRF provided by HIPE
team. Among all CRFs implementations, CRFsuite is the fastest one for training
the model and labeling data.
     In our baseline we use orthographic basic spelling features extracted from
words such as prefix and suffix, the casing of the initial character, and whether
it is a digit.

4.2    Neural network approach
In this section we present our NER neural model followed by a brief description
of used input embeddings and additional features.

Proposed NER Model As in [6, 19], we use in our architecture BiLSTM layers
for the extraction of word-level features. These layers are followed by an attention
layer. We also use a CRF layer on the top of our model, augmented with some
features such as dropout layers. Figure 2 presents our proposed architecture.
Apart from using word representations, we also use character representations to
extract morphological and orthographic features. As shown in Figure 2, word
embeddings are given to a BiLSTM. li and ri represent the word i in its left and
right contexts respectively. The concatenation of these two vectors represent the
word’s context ci [19].
2
    http://www.chokkan.org/software/crfsuite/
Fig. 2. The main architecture of our NER system using BiLSTM, attention and CRF
layers [6, 19].




Input Embeddings The input layers of our model are vector representations
of words. “Learning independent representations for word types from the limited
NER training data is a difficult problem: there are simply too many parameters
to reliably estimate” [19]. In our study, we use pre-trained contextualized word
embeddings to initialize our look-up table and to enrich our training dataset.
In our experiments, we use indomain Flair embeddings provided by HIPE orga-
nizers. “These embeddings were computed with a context of 250 characters, 1
hidden layer of size 2048, and a dropout of 0.1. Input was normalized with low-
ercasing, replacement of digits by 0, everything else was kept as in the original
text” [11]. Extracting character-level representations allows us to take advantage
of features related to the domain in hand. Following [6], we use a CNN layer to
represent each word based on its characters. We initialize a lookup table ran-
domly with values between −0.5 and 0.5 to generate a character representation
of 25 dimensions. The character set is formed by all characters present in the
dataset, with PADDING and UNKNOWN tokens, used for the CNN and all
other characters respectively. Figure 3 [6] presents an example where we give
the word “Picasso” characters embeddings to a CNN.
Fig. 3. An architecture using character embeddings of the word “Picasso” in CNN [6].


Additional features As information related to capitalisation has been removed
during word embeddings’ map construction. We used a separate look-up table
to add this feature with the following options: allCaps (the word is in capital
letters), upperInitial (only the first letter is capitalized), lowercase (the word
is lower cased) and mixedCaps (capital and small letters are mixed) [7, 6]. In
our work, we used additional character-based features as well, by using a look-
up table, to generate a vector which represent the character’s type (uppercase,
lowercase, punctuation or other).


5     Experiments and Evaluations

In this section we present our experiments and the results obtained with different
models.


5.1   Task Description

The CLEF HIPE 2020 shared task includes two NE processing tasks with sub-
tasks of different level of difficulty. In our work we participate in NERC coarse-
grained sub-task of the NERC task. This task includes the recognition and clas-
sification of entity mentions according to high-level entity types. In our case the
types used for annotations are: LOC, ORG, PERS, PROD and TIME.


5.2   Training data

“The shared task corpus is composed of digitized and OCRed articles originating
from Swiss, Luxembourgish and American historical newspaper collections and
selected on a diachronic basis”[11].3 Table 1 shows an overview of the French
corpus statistics.


                  Datasets #docs #tokens #mentions %noisy
                  Train         158   129,925          7885        -
                  Dev            43    29,571          1938        -
                  Test           43    32,035          1802    12.15
                  All           244   191,531        11,625        -

                  Table 1. Overview of French Corpus Statistics[11].




5.3     Training and implementation details

As in [6] we use in our experiments the IOB tagging scheme which stands for
Begin, Inside and Outside. This schema allows us to mark the position of the
word in the named entity. We implement our model using Keras library with
Tensorflow as a backend. As in [6], we initialize LSTM states with zero vec-
tors. Except for the character and word embeddings whose initializations have
been described previously, we initialize all lookup tables randomly. We train our
model with mini-batchs using nadam optimization algorithm. As in [19] we use a
single layer for both forward and backward LSTMs and we apply dropout layers
to make our model learn from word and character features. Furthermore, ap-
plying dropout was effective in reducing overfitting and improving our model’s
performance.


5.4     Evaluation Measures

      NERC task in CLEF HIPE 2020 shared task is evaluated in terms of
      Precision, Recall and F-measure (F1). Evaluation is done at entity level
      according to two metrics: micro average, with the consideration of all
      TP, FP, and FN 4 over all documents, and macro average, with the
      average of document’s micro figures. NERC benefits from strict and fuzzy
      evaluation regimes. For NERC, the strict regime corresponds to exact
      boundary matching and the fuzzy to overlapping boundaries [11].

For more details on evaluation metrics we refer the reader to [11].
3
  From the Swiss National Library, the Luxembourgish National Library, and the
  Library of Congress (Chronicling America project), respectively. Original collections
  correspond to 4 Swiss and Luxembourgish titles, and a dozen for English.
4
  True positive, False positive, False negative
5.5   Models Evaluation
In this section we evaluate different used models which have been already de-
scribed above. In table 2 we cite main differences between models.


  Models      Statistical Orth. features Neural Cont. WE Att. mech.
  Model 1          3              3            7            7           7
  Model 2          7              7            3            7           7
  Model 3          7              7            3            7           3
  Our model        7              7            3            3           3

                       Table 2. Different studied models.




Results Tables 3, 4, 5 and 6 show a comparison of results obtained with different
studied models.

Discussion According to the results presented in tables 3 and 4, we notice on
the one hand that the use of in domain contextualized word embeddings and
attention mechanism lead to a higher F-measure for LOC, PROD and TIME
entities compared to all other models in both fuzzy and strict regimes. On the
other hand the statistical model augmented with orthographic features performs
better in both ORG and PERS entities, this could be explained by the impor-
tance of syntactic information provided by these features and the large portion
of information that they encode which are essential in the NERC task.
    Now if we consider metonymic sense, according to table 5, all neural models
perform better than the statistical model augmented with orthographic features
in both regimes, and our model has higher scores than the two other neural mod-
els, except model 3 which has higher F-measure in strict regime. Moreover, even
if other models have higher precision, our model showed higher recall which lead
to a higher F-measure. We are convinced by the fact that “actively tackling the
problem of OCR noise and hyphenation issues helps to achieve better recall”[11].
These results show that neural models especially our proposed model, where we
use contextualized word embeddings and attention mechanism, perform far bet-
ter than the statistical model on all entities when it is about metonymic sense.
    Now if we consider table 6, we notice that model 2 and model 3 perform
better than the statistical model and barely better than our proposed model on
the ORG entity, which shows that these models were more able to generalize on
test data in this stage.
    All these improvements prove the efficiency of our neural model architec-
ture and of different features used in training, especially contextualized word
embeddings trained on large quantities of raw data and character embeddings
extracted from specific domain dataset. Therefore, our neural model is able to
extract necessary knowledge from training data, without using handcrafted fea-
tures.
    An important aspect of the CLEF HIPE 2020 shared task corpus, and for
historical newspaper data in general, is the noise generated by OCR. As reported
in [11], noisy mentions affect remarkably the model’s performance: “little noise
as 0.1 severely hurts the system’s ability to predict an entity and may cut its
performance by half”[11]. In our study, we do not report results obtained on
the dev set as in the final step, after using dev set to fine tune our model’s
parameters, we used train and dev sets for training. However, we would like to
confirm the degradation of our model’s performance, caused in part by the fact
that “11 % of all mentions in test set contain OCR mistakes”[11].
   Table 3. Our models results for NERC-Coarse in French, considering literal sense of entities (micro average).

                        Model 1                                      Model 2                                      Model 3                                     Our model
               Fuzzy                 Strict                 Fuzzy                 Strict                 Fuzzy                 Strict                 Fuzzy                 Strict

Label    P      R      F1      P      R       F1      P      R      F1      P      R       F1      P      R      F1      P       R      F1      P      R       F1     P       R       F1

 LOC    .853   .797    .824   .778   .727     .752   808    .879    .842   .728   .793     .759   .777   .833    .804   .704   .754     .728   .871   .837    .854   .798    .767    .782
ORG     .596   .454    .515   .586   .446     .507   .482   .408    .442   .427   .362     .392    .5    .438    .467    .43   .377     .402   .488   .454     .47   .405    .377     .39
PERS    .851   .753    .799   .624   .552     .586   .722   .681    .701   .553   .522     .537   .722   .707    .714    .53    .52     .525   .752   .622     .68   .484     .4     .438
PROD    .565   .426     486   .457   .344     .393    .55   .361    .436   .525   .344     .416   .524   .361    .427   .476   .328     .388   .617   .475    .537   .553    .426    .481
TIME    .805   .623    .702   .488   .377     .426   .865   .849    .857   .519   .509     .514   .833   .755    .792   .542   .491     .515   .841   .698    .763   .568    .472    .515
 ALL    .824   .736    .777   .698   .623     .659   .755   .758    .757   .644   .646     .645   .736   .741    .738   .621   .625     .623   .796    .72    .756    .66    .598    .627




   Table 4. Our models results for NERC-Coarse in French, considering literal sense of entities (macro average).

                        Model 1                                      Model 2                                      Model 3                                     Our model
               Fuzzy                 Strict                 Fuzzy                 Strict                 Fuzzy                 Strict                 Fuzzy                 Strict

Label    P      R       F1     P       R       F1     P      R       F1     P       R       F1     P      R      F1      P       R      F1      P      R      F1      P      R       F1

 LOC    .843   .789    .815   .776     .72    .748   .799   .864    .821   .723    .782    .742   .777   .828    .787   .708   .757     .717   .847   .828    .823   .775   .757     .752
ORG     .598   .501    .629   .565    .468    .589   .455   .371    .457   .382    .301    .379   .343   .353     .38   .298   .288     .325   .444   .511    .526   .335   .375     .393
PERS    .849   .745    .809   .622    .551    .596   .745   .704     .74   .597    .568    .595   .717     .7    .706   .563    .55     .554    .75   .609     .68   .502   .414     .462
PROD    .582   .564    .621    .47    .477    .519   .518   .352     .43   .514    .349    .426   .411   .255    .362   .366   .228     .322   .546   .539    .677    .5    .489     .614
TIME    .848   .671    .836   .569    .458    .571   .883   .906    .928   .575    .586    .602    .85    .83    .868   .537   .548     .566    .87   .751    .884   .691   .582     .689
 ALL    .851   .734    .779    .72    .622    .661    .78   .767    .771   .672    .657    .662   .755   .745    .746   .651   .639     .641   .796   .712    .747   .669   .592     .624
Table 5. Our models results for NERC-Coarse in French, considering metonymic sense of entities (micro average).

                         Model 1                                       Model 2                                  Model 3                           Our model
                Fuzzy                Strict                Fuzzy                  Strict                Fuzzy             Strict             Fuzzy         Strict
  Label     P       R    F1      P       R     F1      P       R       F1     P       R       F1    P    R      F1    P    R       F1    P    R    F1   P    R      F1
  ORG .625 .18 .28 .625 .18 .28 .494 .351 .411 .494 .351 .411 .565 .351 .433 .565 .351 .433 .468 .468 .468 .423 .423 .423
  TIME 0     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0   0    0
  ALL .625 .179 .278 .625 .179 .278 .494 .348 .408 .494 .348 .408 .565 .348 .431 .565 .348 .431 .468 .464 .466 .423 .42 .422




Table 6. Our models results for NERC-Coarse in French, considering metonymic sense of entities (macro average).

                            Model 1                                     Model 2                                  Model 3                         Our model
                    Fuzzy                Strict                Fuzzy                  Strict             Fuzzy             Strict            Fuzzy       Strict
    Label       P    R      F1       P     R      F1       P       R     F1       P       R    F1   P     R      F1   P      R      F1   P    R   F1    P   R     F1
   ORG .565 .278 .448 .565 .278 .448 .362 .42 .551 .362 .42 .551 .477 .437 .592 .477 .437 .492 .34 .36 .458 .32 .342 .431
   TIME 0     0    0    0    0    0    0    0     0    0   0    0    0    0    0    0    0    0   0 0      0   0    0    0
   ALL .565 .278 .448 .565 .278 .448 .362 .0.42 .551 .362 .42 .551 .477 .436 .592 .477 .436 .592 .34 .36 .457 .32 .341 .431
6   Conclusion

In this paper, we presented a hybrid approach for NERC applied on historical
newspapers. In our experiments, we used orthographic features related to words
syntax. Besides, we used word and character embeddings, which allow us to
detect morphological and orthographic features related to a specific domain. Our
experiments show an improvement in the overall performance. We notice that
our attentive neural model augmented with contextualized word embeddings
performs better compared to our baselines overall. To the best of our knowledge,
there is no study which combines the application of attention mechanism and
contextualized word embeddings in NERC for historical newspapers domain.
    As a future work, we aim to investigate the usefulness of adding additional
features in the hybrid architecture and the use of external resources such as
ontologies and other knowledge and common sense bases. Applying multi-task
learning will be part of our future work, as well. Moreover, it would be relevant to
apply explainability techniques on the neural network models in order to better
explain and analyze the results.
                               Bibliography


 [1] D. E. Appelt, J. R. Hobbs, J. Bear, D. Israel, M. Kameyama, A. Kehler,
     D. Martin, K. Myers, and M. Tyson.               SRI International FASTUS
     systemMUC-6 test results and analysis. In Sixth Message Understanding
     Conference (MUC-6): Proceedings of a Conference Held in Columbia, Mary-
     land, November 6-8, 1995, pages 237 – 248, 1995.
 [2] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies
     with gradient descent is difficult. IEEE Transactions on Neural Networks,
     5(2):157–166, 1994.
 [3] M. Bollmann. A large-scale comparison of historical text normalization sys-
     tems. In Proceedings of the 2019 Conference of the North American Chapter
     of the Association for Computational Linguistics: Human Language Tech-
     nologies, Volume 1 (Long and Short Papers). Association for Computational
     Linguistics, 2019.
 [4] L. Borin, D. Kokkinakis, and L.-J. Olsson. Naming the past: Named entity
     and Animacy recognition in 19th century Swedish literature. In Proceedings
     of the Workshop on Language Technology for Cultural Heritage Data (LaT-
     eCH 2007)., pages 1–8. Association for Computational Linguistics, 2007.
 [5] G. Chiron, A. Doucet, M. Coustaty, M. Visani, and J. Moreux. Impact of ocr
     errors on the use of digital libraries: Towards a better access to information.
     In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages
     249–252. IEEE Press, 2017.
 [6] J. P. C. Chiu and E. Nichols. Named entity recognition with bidirectional
     lstm-cnns. Transactions of the Association for Computational Linguistics,
     4(1):357–370, 2015.
 [7] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
     P. Kuksa. Natural language processing (almost) from scratch. Journal
     of Machine Learning Research, 12(1):2493–2537, 2011.
 [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training
     of deep bidirectional transformers for language understanding. CoRR,
     abs/1810.04805, 2018.
 [9] C. Dyer, M. Ballesteros, W. Ling, A. Matthews, and N. A. Smith.
     Transition-based dependency parsing with stack long short-term memory.
     volume 1, page 334 – 343, 2015.
[10] M. Ehrmann, M. Romanello, A. Flückiger, and S. Clematide. Extended
     Overview of CLEF HIPE 2020: Named Entity Processing on Historical
     Newspapers. In L. Cappellato, C. Eickhoff, N. Ferro, and A. Névéol, editors,
     CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and
     Labs of the Evaluation Forum. CEUR-WS, 2020.
[11] M. Ehrmann, M. Romanello, A. Flückiger, and S. Clematide. Overview
     of CLEF HIPE 2020: Named Entity Recognition and Linking on Histori-
     cal Newspapers. In A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis,
     H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, and N. Ferro,
     editors, Experimental IR Meets Multilinguality, Multimodality, and Inter-
     action. Proceedings of the 11th International Conference of the CLEF Asso-
     ciation (CLEF 2020), volume 12260 of Lecture Notes in Computer Science
     (LNCS). Springer, 2020.
[12] A. Goyal, V. Gupta, and M. Kumar. Recent named entity recognition and
     classification techniques: A systematic review. Computer Science Review,
     29(1):21–43, 2018.
[13] D. Hanisch, K. Fundel, H.-T. Mevissen, R. Zimmer, and J. Fluck. Prominer:
     rule-based protein and gene entity recognition. BMC Bioinformatics,
     6(1):S14 – S14, 2005.
[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Com-
     putation, 9(8):1735–1780, 1997.
[15] K. Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cun-
     ningham, and Y. Wilks. University of Sheffield: Description of the LaSIE-II
     system as used for MUC-7. In Seventh Message Understanding Conference
     (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29
     - May 1, 1998. Morgan, 1998.
[16] M. Joshi, E. Hart, M. Vogel, and J.-D. Ruvini. Distributed word represen-
     tations improve NER for e-commerce. In Proceedings of the 1st Workshop
     on Vector Space Modeling for Natural Language Processing, pages 160–167,
     Colorado, 2015. Association for Computational Linguistics.
[17] G. R. Krupka and K. Hausman. IsoQuest Inc.: Description of the NetOwlT M
     extractor system as used for MUC-7. In Seventh Message Understanding
     Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Vir-
     ginia, April 29 - May 1, 1998, pages 21 – 28, 1998.
[18] J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random
     fields: Probabilistic models for segmenting and labeling sequence data. In
     Proceedings of the Eighteenth International Conference on Machine Learn-
     ing, pages 282–289. Morgan Kaufmann Publishers Inc., 2001.
[19] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer.
     Neural architectures for named entity recognition. In Proceedings of
     NAACL-HLT 2016, page 260–270, 2016.
[20] J. Li, A. Sun, J. Han, and C. Li. A survey on deep learning for named entity
     recognition. CoRR, 2018.
[21] W. Ling, T. Luı́s, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black,
     and I. Trancoso. Finding function in form: Compositional character models
     for open vocabulary word representation. In Proceedings of the Conference
     on Empirical Methods in Natural Language Processing (EMNLP), 2015.
[22] E. Linhares Pontes, A. Hamdi, N. Sidere, and A. Doucet. Impact of ocr
     quality on named entity linking. In A. Jatowt, A. Maeda, and S. Y. Syn,
     editors, Digital Libraries at the Crossroads of Digital Information for the
     Future, pages 102–115. Springer International Publishing, 2019.
[23] Y. Luo, F. Xiao, and H. Zhao. Hierarchical contextualized representation
     for named entity recognition. CoRR, abs/1911.02257, 2019.
[24] X. Ma and E. Hovy. End-to-end sequence labeling via bi-directional LSTM-
     CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association
     for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074.
     Association for Computational Linguistics, 2016.
[25] D. Nadeau and S. Sekine. A survey of named entity recognition and classi-
     fication. Lingvisticae Investigationes, 30(1):3–26, 2007.
[26] C. Neudecker and A. Antonacopoulos. Making europe’s historical newspa-
     pers searchable. 2016 12th IAPR Workshop on Document Analysis Systems
     (DAS), pages 405–410, 2016.
[27] A. P. Quimbaya, A. S. Múnera, R. A. G. Rivera, J. C. D. Rodrı́guez,
     O. M. M. Velandia, A. A. G. Peña, and C. Labbé. Named entity recog-
     nition over electronic health records through a combined dictionary-based
     approach. Procedia Computer Science, 100(1):55 – 61, 2016.
[28] D. A. Smith and R. Cordell. A Research Agenda for Historical and Multi-
     lingual Optical Character Recognition. Tech. rep. 2018.
[29] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
     L. Kaiser, and I. Polosukhin.        Attention is all you need.    CoRR,
     abs/1706.03762, 2017.
[30] V. Yadav and S. Bethard. A survey on recent advances in named entity
     recognition from deep learning models. In Proceedings of the 27th Interna-
     tional Conference on Computational Linguistics, pages 2145–2158. Associ-
     ation for Computational Linguistics, 2018.
[31] W. Yin, K. Kann, M. Yu, and H. Schütze. Comparative study of CNN and
     RNN for natural language processing. CoRR, abs/1702.01923, 2017.