<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>F</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Hybrid Statistical and Attentive Deep Neural Approach for Named Entity Recognition in Historical Newspapers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ghaith Dekhili</string-name>
          <email>dekhili.ghaith@courrier.uqam.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fatiha Sadat</string-name>
          <email>sadat.fatiha@uqam.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Quebec in Montreal</institution>
          ,
          <addr-line>201 President Kennedy avenue, H2X 3Y7 Montreal, Quebec</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <volume>4</volume>
      <issue>4</issue>
      <abstract>
        <p>Neural networks-based models have proved their e ciency on Named Entities Recognition, one of the well-known NLP task. Besides, attention mechanism has become an integral part of compelling sequence modeling and transduction models on various tasks. This technique allows context representation in a sequence by taking into consideration neighboring words. In this study, we propose an architecture that involves BiLSTM layers combined with a CRF layer and an attention layer in between. This was augmented with pre-trained contextualized word embeddings and dropout layers. Moreover, apart from using word representations, we use character-based representations, extracted by CNN layers, to capture morphological and orthographic information. Our experiments show an improvement in the overall performance. We notice that our attentive neural model augmented with contextualized word embeddings gives higher scores compared to our baselines. To the best of our knowledge, there is no study which combines the application of attention mechanism and contextualized word embeddings on NER and historical newspapers.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Neural Networks</kwd>
        <kwd>Attention Mechanism alized Word Embeddings</kwd>
        <kwd>Character Embeddings</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This work is done as part of the HIPE (Identifying Historical People, Places
and other Entities) shared task, \organised as a CLEF 2020 evaluation Lab and
dedicated to the evaluation of named entity processing on historical newspapers
in French, German and English" [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The shared task is organized as part of
\impresso Media Monitoring of the Past", a project focused on information
extraction in historical newspapers. 1
      </p>
      <p>Copyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
September 2020, Thessaloniki, Greece.
1 https://impresso-project.ch/</p>
      <p>Named Entity Recognition and Classi cation (NERC) is a sub-task of
information extraction and Natural Language Processing (NLP). It consists on
identifying certain textual objects such as names of persons, organizations and
places.</p>
      <p>
        Early NER systems were based on handcrafted rules, lexicons, orthographic
features and external knowledge resources. This was followed by \feature
engineering based NER systems" and machine learning [
        <xref ref-type="bibr" rid="ref12">25</xref>
        ]. Starting with [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], neural
networks based systems with a minimum of feature engineering have become
popular. Such models are interesting because they typically do not need
domain speci c resources like in earlier systems, and are thus quali ed to be more
domain independent. Many neural architectures have been introduced, most of
them based on some form of Recurrent Neural Networks (RNN) over characters,
sub-words and/or word embeddings [
        <xref ref-type="bibr" rid="ref17">30</xref>
        ].
      </p>
      <p>
        NER systems based on knowledge do not need labeled data as they rely on
lexicon resources and domain speci c knowledge. These systems perform well in
cases where the lexicon is exhaustive, but fail when the information does not
exist in domain dictionaries [
        <xref ref-type="bibr" rid="ref17">30</xref>
        ]. A second drawback of these systems is that
they require domain experts to construct and maintain the knowledge resources.
Finally, these systems can be used only on domains and languages for which they
were designed, because of the speci c features they had learned during training
[12].
      </p>
      <p>
        Supervised machine learning models learn how to make predictions during
training on couples of inputs and their expected outputs, and can be used in
place of handcrafted rules [
        <xref ref-type="bibr" rid="ref17">30</xref>
        ].
      </p>
      <p>
        NER task becomes more challenging when applied on historical and cultural
heritage collections. On the one side, inputs can be extremely noisy, with errors
which di er from the ones in tweet misspellings or speech transcription
hesitations [
        <xref ref-type="bibr" rid="ref15 ref5">22, 5, 28</xref>
        ]. On the other side, the language that we have is mostly of earlier
stage, \which renders usual external and internal evidences less e ective (e.g.,
the usage of di erent naming conventions and presence of historical spelling
variations)" [
        <xref ref-type="bibr" rid="ref3 ref4">4, 3</xref>
        ]. Finally, \archives and texts from the past are not as anglophone
as in today's information society, making multilingual resources and processing
capacities even more essential" [
        <xref ref-type="bibr" rid="ref11 ref13">26, 11</xref>
        ]. In this context, the objective of CLEF
HIPE 2020 shared task is threefold:
strengthening the robustness of existing approaches on non-standard
inputs; enabling performance comparison of NE processing on historical
texts; and in the long run, fostering e cient semantic indexing of
historical documents in order to support scholarship on digital cultural
heritage collections [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Main NER approaches are based on computational linguistics and machine
learning. [13] proposed ProMiner which is based on a dictionary of synonyms to
identify genes and proteins mentions in text and link them to their corresponding
ids in the dictionary. [
        <xref ref-type="bibr" rid="ref14">27</xref>
        ] presented an approach based on dictionaries as well
for NER in the medical domain. There are other well-known rules based NER
systems such as LaSIE-II [15], NetOwl [17] and FASTUS [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These systems are
mainly based on semantic and syntactic rules to recognize entities [20].
      </p>
      <p>Among machine learning applied techniques, we quote Hidden Markov Model
(HMM), Maximum Entropy, decision trees, Support Vector Machines (SVM)
and Conditional Random Fields (CRF). [18] proposed a CRF model and include
morphological features, Part-Of-Speech (POS) Tags, and words sequences. [16]
used a CRF too, and show that using Word2Vec pre-trained word embeddings
improves NER models performances.</p>
      <p>
        On the other hand, neural networks based models have proved their e ciency
on NER tasks. Long Short-Term Memory (LSTM) [14] based neural networks
have been widely used in di erent NLP applications thanks to their ability to
detect long-term dependencies. These models showed good results compared to
traditional approaches, even if they do not need dictionaries, Gazetteers or other
additional information. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] presented a hybrid model by combining Bidirectional
LSTM (BiLSTM) and a Convolutional Neural Network (CNN). [19] introduced
a neural model similar to [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], based on BiLSTM combined with a CRF. [23] used
the attention mechanism to develop a model which takes advantage of sentence
level and document level hierarchical contextual representations. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] introduced
BERT, acronym of Bidirectional Encoder Representations from Transformers,
\which is a language model designed for pre-training deep bidirectional
representations from unlabeled text by jointly conditioning on both left and right
context in all layers" [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The model obtained new state-of-the-art results on
eleven natural language processing tasks. For more details on related work we
refer the reader to [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Background on some Supervised ML Models</title>
      <p>In this section we present some supervised machine learning models used in this
research, such as LSTM operating model, BiLSTM which is the combination of
two LSTMs, followed by a brief description of the CRF modelling model and its
usefulness.
3.1</p>
      <sec id="sec-3-1">
        <title>The Long Short-Term Memory model</title>
        <p>LSTM is a RNN architecture used in the eld of deep learning. \This
powerful family of connectionist models can capture time dynamics via cycles in the
graph" [14, 24].</p>
        <p>
          RNNs take as input a vectors sequence (x1; x2; :::; xn) at time t and return the
hidden state vectors sequence (h1; h2; :::; hn), which stocks information learned
in actual and previous steps. \Although RNNs can, in theory, learn long
dependencies, in practice they fail to do so and tend to be biased towards their most
recent inputs in the sequence"[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Thanks to their memory cells, LSTMs are able
to resolve this point by capturing useful information from previous states[14]. A
LSTM unit is updated at a time t using the following equations:
it = (Wiht 1 + Uixt + bi)
ft = (Wf ht 1 + Uf xt + bf )
c~t = tanh(Wcht 1 + Ucxt + bc)
ct = ft
ct 1 + it
        </p>
        <p>c~t
ot = (Woht 1 + Uoxt + bo)
ht = ot
tanh(ct)
(1)
(2)
(3)
(4)
(5)
(6)
where:
[14, 24].</p>
        <p>is the element-wise sigmoid function
is the element-wise product
xt is the input vector at time t
ht is the hidden state (could be called output) vector stocking useful
information at (and before) time t
Ui , Uf , Uc , Uo are the weight matrices of di erent gates for input xt
Wi,Wf ,Wc,Wo denote the weight matrices for hidden state ht
bi, bf , bc, bo are the bias vectors
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Bidirectional LSTM</title>
        <p>
          A LSTM computes a hidden state vector !ht representative of the left context in
a sentence at every step t [19]. To take advantage of information that we could
get from treating the same sentence in reverse, [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] proposed the BiLSTM model.
The idea is to use an another LSTM, to generate a second hidden state vector
ht representative of the right context in the sentence. Concatenating these two
vectors leads to a representation ht=[!ht ;ht ] of the word in its general context.
The resulting representation is useful for numerous tagging applications [19].
3.3
        </p>
        <p>CRF
\A very simple but surprisingly e ective tagging model is to use the ht's as
features to make independent tagging decisions for each output yt"[21]. In sequence
labeling tasks, taking into consideration neighboring labels could be helpful while
analyzing a given input sentence, like in some \grammar" rules where a noun
more likely follows an adjective than a verb, this can be equivalent in NER to
the fact that I-ORG cannot follow I-PERS [24].</p>
        <p>Therefore, as in the research presented by [19], we model label sequence
jointly using a CRF, instead of modeling them independently.
3.4</p>
      </sec>
      <sec id="sec-3-3">
        <title>Extracting Character Features Using a CNN</title>
        <p>
          CNN layers have become ubiquitous in many NLP tasks. As in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we use for
each word a convolution and a max layer to extract a per-character feature and
optionally the character type to obtain at the end a new character embedding
with resulting features. A special token \PADDING" has been used on both
sides of words to keep the same length of sequences. In this section we present
a brief description of CNN applied to text [
          <xref ref-type="bibr" rid="ref18">31</xref>
          ].
        </p>
        <p>
          Input Layer An input sequence x with n elements, each one is represented by a
d-dimensional vector, can be represented as a map of features of dimensionality
d x n. The bottom of gure 1 shows the input layer as a rectangle with multiple
columns [
          <xref ref-type="bibr" rid="ref18">31</xref>
          ].
        </p>
        <p>Convolution Layer Convolution layer is employed to represent learning by
sliding w-grams over an input sequence (x1; x2; :::; xn). We consider a vector ci</p>
        <p>
          Rwd as the concatenated embeddings of w entries (xi w+1; :::; xi), where w is
the lter width and 0 &lt; i &lt; s + w. We pad embeddings of xi , where i &lt; 1
or i &gt; n, with zeros. We then represent the w-grams (xi w+1; :::; xi) by a new
vector pi Rd using the convolution weights W Rd wd:
pi = tanh(W ci + b)
(7)
where b Rd is the bias [
          <xref ref-type="bibr" rid="ref18">31</xref>
          ].
        </p>
        <p>
          Maxpooling We use w-gram representations pi (i = 1:::s+w 1) to generate the
input sequence x representation by applying maxpooling: xj = max(p1;j; p2;j; :::)
where (j = 1; :::; d) [
          <xref ref-type="bibr" rid="ref18">31</xref>
          ].
3.5
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>Attention mechanism</title>
        <p>
          \The attention mechanism has become an integral part of compelling sequence
modeling and transduction models in various tasks"[
          <xref ref-type="bibr" rid="ref16">29</xref>
          ]. This technique allows
to represent the context in a sequence by taking into consideration neighboring
words [
          <xref ref-type="bibr" rid="ref16">29</xref>
          ]. For more details on attention mechanism we refer the reader to [
          <xref ref-type="bibr" rid="ref16">29</xref>
          ].
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Our Proposed Approach</title>
      <p>This study aims to compare the e ectiveness of our proposed attentive neural
approach in recognizing and classifying NEs in historical newspapers with a
comparison to a statistical model augmented with orthographic features and
two other neural models.
4.1</p>
      <sec id="sec-4-1">
        <title>Statistical approach</title>
        <p>
          Statistical approaches based on CRF, SVM or Perceptron have proven good
performances using only handcrafted features in many NLP tasks such as NERC
[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. In our work, we use CRFsuite2 implementation of CRF provided by HIPE
team. Among all CRFs implementations, CRFsuite is the fastest one for training
the model and labeling data.
        </p>
        <p>In our baseline we use orthographic basic spelling features extracted from
words such as pre x and su x, the casing of the initial character, and whether
it is a digit.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Neural network approach</title>
        <p>In this section we present our NER neural model followed by a brief description
of used input embeddings and additional features.</p>
        <p>
          Proposed NER Model As in [
          <xref ref-type="bibr" rid="ref6">6, 19</xref>
          ], we use in our architecture BiLSTM layers
for the extraction of word-level features. These layers are followed by an attention
layer. We also use a CRF layer on the top of our model, augmented with some
features such as dropout layers. Figure 2 presents our proposed architecture.
Apart from using word representations, we also use character representations to
extract morphological and orthographic features. As shown in Figure 2, word
embeddings are given to a BiLSTM. li and ri represent the word i in its left and
right contexts respectively. The concatenation of these two vectors represent the
word's context ci [19].
2 http://www.chokkan.org/software/crfsuite/
        </p>
        <p>
          Input Embeddings The input layers of our model are vector representations
of words. \Learning independent representations for word types from the limited
NER training data is a di cult problem: there are simply too many parameters
to reliably estimate" [19]. In our study, we use pre-trained contextualized word
embeddings to initialize our look-up table and to enrich our training dataset.
In our experiments, we use indomain Flair embeddings provided by HIPE
organizers. \These embeddings were computed with a context of 250 characters, 1
hidden layer of size 2048, and a dropout of 0.1. Input was normalized with
lowercasing, replacement of digits by 0, everything else was kept as in the original
text" [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Extracting character-level representations allows us to take advantage
of features related to the domain in hand. Following [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we use a CNN layer to
represent each word based on its characters. We initialize a lookup table
randomly with values between 0.5 and 0.5 to generate a character representation
of 25 dimensions. The character set is formed by all characters present in the
dataset, with PADDING and UNKNOWN tokens, used for the CNN and all
other characters respectively. Figure 3 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] presents an example where we give
the word \Picasso" characters embeddings to a CNN.
Additional features As information related to capitalisation has been removed
during word embeddings' map construction. We used a separate look-up table
to add this feature with the following options: allCaps (the word is in capital
letters), upperInitial (only the rst letter is capitalized), lowercase (the word
is lower cased) and mixedCaps (capital and small letters are mixed) [
          <xref ref-type="bibr" rid="ref6 ref7">7, 6</xref>
          ]. In
our work, we used additional character-based features as well, by using a
lookup table, to generate a vector which represent the character's type (uppercase,
lowercase, punctuation or other).
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments and Evaluations</title>
      <p>In this section we present our experiments and the results obtained with di erent
models.
5.1</p>
      <sec id="sec-5-1">
        <title>Task Description</title>
        <p>The CLEF HIPE 2020 shared task includes two NE processing tasks with
subtasks of di erent level of di culty. In our work we participate in NERC
coarsegrained sub-task of the NERC task. This task includes the recognition and
classi cation of entity mentions according to high-level entity types. In our case the
types used for annotations are: LOC, ORG, PERS, PROD and TIME.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Training data</title>
        <p>
          \The shared task corpus is composed of digitized and OCRed articles originating
from Swiss, Luxembourgish and American historical newspaper collections and
selected on a diachronic basis"[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].3 Table 1 shows an overview of the French
corpus statistics.
        </p>
        <sec id="sec-5-2-1">
          <title>Datasets #docs #tokens #mentions %noisy</title>
        </sec>
        <sec id="sec-5-2-2">
          <title>Train</title>
          <p>Dev</p>
        </sec>
        <sec id="sec-5-2-3">
          <title>Test All</title>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>Training and implementation details</title>
        <p>
          As in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] we use in our experiments the IOB tagging scheme which stands for
Begin, Inside and Outside. This schema allows us to mark the position of the
word in the named entity. We implement our model using Keras library with
Tensor ow as a backend. As in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], we initialize LSTM states with zero
vectors. Except for the character and word embeddings whose initializations have
been described previously, we initialize all lookup tables randomly. We train our
model with mini-batchs using nadam optimization algorithm. As in [19] we use a
single layer for both forward and backward LSTMs and we apply dropout layers
to make our model learn from word and character features. Furthermore,
applying dropout was e ective in reducing over tting and improving our model's
performance.
5.4
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>Evaluation Measures</title>
        <p>
          NERC task in CLEF HIPE 2020 shared task is evaluated in terms of
Precision, Recall and F-measure (F1). Evaluation is done at entity level
according to two metrics: micro average, with the consideration of all
TP, FP, and FN 4 over all documents, and macro average, with the
average of document's micro gures. NERC bene ts from strict and fuzzy
evaluation regimes. For NERC, the strict regime corresponds to exact
boundary matching and the fuzzy to overlapping boundaries [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
For more details on evaluation metrics we refer the reader to [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
3 From the Swiss National Library, the Luxembourgish National Library, and the
Library of Congress (Chronicling America project), respectively. Original collections
correspond to 4 Swiss and Luxembourgish titles, and a dozen for English.
4 True positive, False positive, False negative
5.5
        </p>
      </sec>
      <sec id="sec-5-5">
        <title>Models Evaluation</title>
        <p>In this section we evaluate di erent used models which have been already
described above. In table 2 we cite main di erences between models.</p>
        <sec id="sec-5-5-1">
          <title>Models</title>
        </sec>
        <sec id="sec-5-5-2">
          <title>Model 1</title>
        </sec>
        <sec id="sec-5-5-3">
          <title>Model 2</title>
        </sec>
        <sec id="sec-5-5-4">
          <title>Model 3</title>
        </sec>
        <sec id="sec-5-5-5">
          <title>Our model</title>
          <p>3
7
7
7</p>
        </sec>
        <sec id="sec-5-5-6">
          <title>Statistical Orth. features Neural Cont. WE Att. mech.</title>
          <p>3
7
7
7
7
3
3
3
7
7
7
3
7
7
3
3</p>
          <p>Results Tables 3, 4, 5 and 6 show a comparison of results obtained with di erent
studied models.</p>
          <p>Discussion According to the results presented in tables 3 and 4, we notice on
the one hand that the use of in domain contextualized word embeddings and
attention mechanism lead to a higher F-measure for LOC, PROD and TIME
entities compared to all other models in both fuzzy and strict regimes. On the
other hand the statistical model augmented with orthographic features performs
better in both ORG and PERS entities, this could be explained by the
importance of syntactic information provided by these features and the large portion
of information that they encode which are essential in the NERC task.</p>
          <p>
            Now if we consider metonymic sense, according to table 5, all neural models
perform better than the statistical model augmented with orthographic features
in both regimes, and our model has higher scores than the two other neural
models, except model 3 which has higher F-measure in strict regime. Moreover, even
if other models have higher precision, our model showed higher recall which lead
to a higher F-measure. We are convinced by the fact that \actively tackling the
problem of OCR noise and hyphenation issues helps to achieve better recall"[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
These results show that neural models especially our proposed model, where we
use contextualized word embeddings and attention mechanism, perform far
better than the statistical model on all entities when it is about metonymic sense.
          </p>
          <p>Now if we consider table 6, we notice that model 2 and model 3 perform
better than the statistical model and barely better than our proposed model on
the ORG entity, which shows that these models were more able to generalize on
test data in this stage.</p>
          <p>All these improvements prove the e ciency of our neural model
architecture and of di erent features used in training, especially contextualized word
embeddings trained on large quantities of raw data and character embeddings
extracted from speci c domain dataset. Therefore, our neural model is able to
extract necessary knowledge from training data, without using handcrafted
features.</p>
          <p>
            An important aspect of the CLEF HIPE 2020 shared task corpus, and for
historical newspaper data in general, is the noise generated by OCR. As reported
in [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], noisy mentions a ect remarkably the model's performance: \little noise
as 0.1 severely hurts the system's ability to predict an entity and may cut its
performance by half"[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. In our study, we do not report results obtained on
the dev set as in the nal step, after using dev set to ne tune our model's
parameters, we used train and dev sets for training. However, we would like to
con rm the degradation of our model's performance, caused in part by the fact
that \11 % of all mentions in test set contain OCR mistakes"[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>3 2
1 2 0 2
F 4 4
.</p>
          <p>3 3
P 24 0 2</p>
          <p>4
. .</p>
          <p>8 6
1 6 0 6
F 4 4</p>
          <p>.</p>
          <p>.
3
l
e
d
2
l
e
d
o
M
1
l
e
d
o
M
y
F
z 8 4
z
u R 46 0 6</p>
          <p>4
. .</p>
          <p>8 8
P 64 0 6</p>
          <p>4
. .</p>
          <p>3 1
1 3 0 3
F 4
.
m
(
.
)
e
g
a
r
e
v
a
o
r
c
i
s
e
i
t
i
t
n
e
f
o
e
s
n
e
s
c
i
y
n
o
t
e
n
i
r
e
d
i
s
n
o
c
,
h
c
n
e
r
m
F
n
i
e
s
r
a
o
C
C
R
E
N
r
o
f
s
t
l
u
s
e
r
s
l
e
d
o
m
r
u
O
a</p>
          <p>5 5
P 65 0 6</p>
          <p>5
. .
.</p>
          <p>.</p>
          <p>8 8
R 27 0 7</p>
          <p>2
. .</p>
          <p>5 5
P 65 0 6</p>
          <p>5
. .
l
e</p>
          <p>E L</p>
          <p>G
ab R IML</p>
          <p>A
L O T</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we presented a hybrid approach for NERC applied on historical
newspapers. In our experiments, we used orthographic features related to words
syntax. Besides, we used word and character embeddings, which allow us to
detect morphological and orthographic features related to a speci c domain. Our
experiments show an improvement in the overall performance. We notice that
our attentive neural model augmented with contextualized word embeddings
performs better compared to our baselines overall. To the best of our knowledge,
there is no study which combines the application of attention mechanism and
contextualized word embeddings in NERC for historical newspapers domain.</p>
      <p>As a future work, we aim to investigate the usefulness of adding additional
features in the hybrid architecture and the use of external resources such as
ontologies and other knowledge and common sense bases. Applying multi-task
learning will be part of our future work, as well. Moreover, it would be relevant to
apply explainability techniques on the neural network models in order to better
explain and analyze the results.
editors, Experimental IR Meets Multilinguality, Multimodality, and
Interaction. Proceedings of the 11th International Conference of the CLEF
Association (CLEF 2020), volume 12260 of Lecture Notes in Computer Science
(LNCS). Springer, 2020.
[12] A. Goyal, V. Gupta, and M. Kumar. Recent named entity recognition and
classi cation techniques: A systematic review. Computer Science Review,
29(1):21{43, 2018.
[13] D. Hanisch, K. Fundel, H.-T. Mevissen, R. Zimmer, and J. Fluck. Prominer:
rule-based protein and gene entity recognition. BMC Bioinformatics,
6(1):S14 { S14, 2005.
[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural
Computation, 9(8):1735{1780, 1997.
[15] K. Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H.
Cunningham, and Y. Wilks. University of She eld: Description of the LaSIE-II
system as used for MUC-7. In Seventh Message Understanding Conference
(MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29
- May 1, 1998. Morgan, 1998.
[16] M. Joshi, E. Hart, M. Vogel, and J.-D. Ruvini. Distributed word
representations improve NER for e-commerce. In Proceedings of the 1st Workshop
on Vector Space Modeling for Natural Language Processing, pages 160{167,
Colorado, 2015. Association for Computational Linguistics.
[17] G. R. Krupka and K. Hausman. IsoQuest Inc.: Description of the NetOwlT M
extractor system as used for MUC-7. In Seventh Message Understanding
Conference (MUC-7): Proceedings of a Conference Held in Fairfax,
Virginia, April 29 - May 1, 1998, pages 21 { 28, 1998.
[18] J. D. La erty, A. McCallum, and F. C. N. Pereira. Conditional random
elds: Probabilistic models for segmenting and labeling sequence data. In
Proceedings of the Eighteenth International Conference on Machine
Learning, pages 282{289. Morgan Kaufmann Publishers Inc., 2001.
[19] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer.</p>
      <p>Neural architectures for named entity recognition. In Proceedings of
NAACL-HLT 2016, page 260{270, 2016.
[20] J. Li, A. Sun, J. Han, and C. Li. A survey on deep learning for named entity
recognition. CoRR, 2018.
[21] W. Ling, T. Lu s, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black,
and I. Trancoso. Finding function in form: Compositional character models
for open vocabulary word representation. In Proceedings of the Conference
on Empirical Methods in Natural Language Processing (EMNLP), 2015.
[22] E. Linhares Pontes, A. Hamdi, N. Sidere, and A. Doucet. Impact of ocr
quality on named entity linking. In A. Jatowt, A. Maeda, and S. Y. Syn,
editors, Digital Libraries at the Crossroads of Digital Information for the
Future, pages 102{115. Springer International Publishing, 2019.
[23] Y. Luo, F. Xiao, and H. Zhao. Hierarchical contextualized representation
for named entity recognition. CoRR, abs/1911.02257, 2019.
[24] X. Ma and E. Hovy. End-to-end sequence labeling via bi-directional
LSTMCNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Appelt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hobbs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bear</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Israel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kameyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kehler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Myers</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tyson</surname>
          </string-name>
          . SRI International FASTUS systemMUC-
          <article-title>6 test results and analysis</article-title>
          .
          <source>In Sixth Message Understanding Conference (MUC-6): Proceedings of a Conference Held in Columbia, Maryland, November 6-8</source>
          ,
          <year>1995</year>
          , pages
          <fpage>237</fpage>
          {
          <fpage>248</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Simard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Frasconi</surname>
          </string-name>
          .
          <article-title>Learning long-term dependencies with gradient descent is di cult</article-title>
          .
          <source>IEEE Transactions on Neural Networks</source>
          ,
          <volume>5</volume>
          (
          <issue>2</issue>
          ):
          <volume>157</volume>
          {
          <fpage>166</fpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bollmann</surname>
          </string-name>
          .
          <article-title>A large-scale comparison of historical text normalization systems</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers).
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Borin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kokkinakis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Olsson</surname>
          </string-name>
          .
          <article-title>Naming the past: Named entity and Animacy recognition in 19th century Swedish literature</article-title>
          .
          <source>In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH</source>
          <year>2007</year>
          )., pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          8. Association for Computational Linguistics,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chiron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doucet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Coustaty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Visani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Moreux</surname>
          </string-name>
          .
          <article-title>Impact of ocr errors on the use of digital libraries: Towards a better access to information</article-title>
          .
          <source>In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL)</source>
          , pages
          <fpage>249</fpage>
          {
          <fpage>252</fpage>
          . IEEE Press,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. P. C.</given-names>
            <surname>Chiu</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Nichols</surname>
          </string-name>
          .
          <article-title>Named entity recognition with bidirectional lstm-cnns</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <volume>357</volume>
          {
          <fpage>370</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karlen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Kuksa</surname>
          </string-name>
          .
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          (
          <issue>1</issue>
          ):
          <volume>2493</volume>
          {
          <fpage>2537</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . BERT:
          <article-title>pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1810</year>
          .04805,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ballesteros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matthews</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          .
          <article-title>Transition-based dependency parsing with stack long short-term memory</article-title>
          . volume
          <volume>1</volume>
          , page
          <volume>334</volume>
          {
          <fpage>343</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanello</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Fluckiger, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Clematide</surname>
          </string-name>
          .
          <source>Extended Overview of CLEF HIPE</source>
          <year>2020</year>
          :
          <article-title>Named Entity Processing on Historical Newspapers</article-title>
          . In L. Cappellato,
          <string-name>
            <given-names>C.</given-names>
            <surname>Eickho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A</surname>
          </string-name>
          . Neveol, editors,
          <source>CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ehrmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanello</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Fluckiger, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Clematide</surname>
          </string-name>
          .
          <source>Overview of CLEF HIPE</source>
          <year>2020</year>
          :
          <article-title>Named Entity Recognition and Linking on Historical Newspapers</article-title>
          . In A.
          <string-name>
            <surname>Arampatzis</surname>
            , E. Kanoulas,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tsikrika</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Vrochidis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Joho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lioma</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neveol</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cappellato</surname>
            , and
            <given-names>N.</given-names>
          </string-name>
          <article-title>Ferro, for Computational Linguistics (Volume 1: Long Papers)</article-title>
          , pages
          <fpage>1064</fpage>
          {
          <fpage>1074</fpage>
          . Association for Computational Linguistics,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nadeau</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Sekine</surname>
          </string-name>
          .
          <article-title>A survey of named entity recognition and classication</article-title>
          .
          <source>Lingvisticae Investigationes</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ):3{
          <fpage>26</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Neudecker</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Antonacopoulos</surname>
          </string-name>
          .
          <article-title>Making europe's historical newspapers searchable</article-title>
          .
          <source>2016 12th IAPR Workshop on Document Analysis Systems (DAS)</source>
          , pages
          <fpage>405</fpage>
          {
          <fpage>410</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Quimbaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Munera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A. G.</given-names>
            <surname>Rivera</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. C. D.</surname>
          </string-name>
          <article-title>Rodr guez,</article-title>
          <string-name>
            <surname>O. M. M. Velandia</surname>
            ,
            <given-names>A. A. G.</given-names>
          </string-name>
          <article-title>Pen~a, and</article-title>
          <string-name>
            <given-names>C.</given-names>
            <surname>Labbe</surname>
          </string-name>
          .
          <article-title>Named entity recognition over electronic health records through a combined dictionary-based approach</article-title>
          .
          <source>Procedia Computer Science</source>
          ,
          <volume>100</volume>
          (
          <issue>1</issue>
          ):
          <volume>55</volume>
          {
          <fpage>61</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Cordell</surname>
          </string-name>
          .
          <article-title>A Research Agenda for Historical and Multilingual Optical Character Recognition</article-title>
          .
          <source>Tech. rep</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Polosukhin.</surname>
          </string-name>
          <article-title>Attention is all you need</article-title>
          .
          <source>CoRR, abs/1706.03762</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>V.</given-names>
            <surname>Yadav</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          .
          <article-title>A survey on recent advances in named entity recognition from deep learning models</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Computational Linguistics</source>
          , pages
          <volume>2145</volume>
          {
          <fpage>2158</fpage>
          . Association for Computational Linguistics,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Schu</surname>
          </string-name>
          <article-title>tze. Comparative study of CNN and RNN for natural language processing</article-title>
          .
          <source>CoRR, abs/1702</source>
          .
          <year>01923</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>