-

Hybrid Statistical and Attentive Deep Neural Approach for Named Entity Recognition in Historical Newspapers

Ghaith Dekhili

dekhili.ghaith@courrier.uqam.ca 0

Fatiha Sadat

sadat.fatiha@uqam.ca 0 0 University of Quebec in Montreal , 201 President Kennedy avenue, H2X 3Y7 Montreal, Quebec , Canada

4 4

Neural networks-based models have proved their e ciency on Named Entities Recognition, one of the well-known NLP task. Besides, attention mechanism has become an integral part of compelling sequence modeling and transduction models on various tasks. This technique allows context representation in a sequence by taking into consideration neighboring words. In this study, we propose an architecture that involves BiLSTM layers combined with a CRF layer and an attention layer in between. This was augmented with pre-trained contextualized word embeddings and dropout layers. Moreover, apart from using word representations, we use character-based representations, extracted by CNN layers, to capture morphological and orthographic information. Our experiments show an improvement in the overall performance. We notice that our attentive neural model augmented with contextualized word embeddings gives higher scores compared to our baselines. To the best of our knowledge, there is no study which combines the application of attention mechanism and contextualized word embeddings on NER and historical newspapers.

Deep Neural Networks Attention Mechanism alized Word Embeddings Character Embeddings

This work is done as part of the HIPE (Identifying Historical People, Places and other Entities) shared task, \organised as a CLEF 2020 evaluation Lab and dedicated to the evaluation of named entity processing on historical newspapers in French, German and English" [ 11 ]. The shared task is organized as part of \impresso Media Monitoring of the Past", a project focused on information extraction in historical newspapers. 1

Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessaloniki, Greece. 1 https://impresso-project.ch/

Named Entity Recognition and Classi cation (NERC) is a sub-task of information extraction and Natural Language Processing (NLP). It consists on identifying certain textual objects such as names of persons, organizations and places.

Early NER systems were based on handcrafted rules, lexicons, orthographic features and external knowledge resources. This was followed by \feature engineering based NER systems" and machine learning [ 25 ]. Starting with [ 7 ], neural networks based systems with a minimum of feature engineering have become popular. Such models are interesting because they typically do not need domain speci c resources like in earlier systems, and are thus quali ed to be more domain independent. Many neural architectures have been introduced, most of them based on some form of Recurrent Neural Networks (RNN) over characters, sub-words and/or word embeddings [ 30 ].

NER systems based on knowledge do not need labeled data as they rely on lexicon resources and domain speci c knowledge. These systems perform well in cases where the lexicon is exhaustive, but fail when the information does not exist in domain dictionaries [ 30 ]. A second drawback of these systems is that they require domain experts to construct and maintain the knowledge resources. Finally, these systems can be used only on domains and languages for which they were designed, because of the speci c features they had learned during training [12].

Supervised machine learning models learn how to make predictions during training on couples of inputs and their expected outputs, and can be used in place of handcrafted rules [ 30 ].

NER task becomes more challenging when applied on historical and cultural heritage collections. On the one side, inputs can be extremely noisy, with errors which di er from the ones in tweet misspellings or speech transcription hesitations [ 22, 5, 28 ]. On the other side, the language that we have is mostly of earlier stage, \which renders usual external and internal evidences less e ective (e.g., the usage of di erent naming conventions and presence of historical spelling variations)" [ 4, 3 ]. Finally, \archives and texts from the past are not as anglophone as in today's information society, making multilingual resources and processing capacities even more essential" [ 26, 11 ]. In this context, the objective of CLEF HIPE 2020 shared task is threefold: strengthening the robustness of existing approaches on non-standard inputs; enabling performance comparison of NE processing on historical texts; and in the long run, fostering e cient semantic indexing of historical documents in order to support scholarship on digital cultural heritage collections [ 11 ]. 2

Related Work

Main NER approaches are based on computational linguistics and machine learning. [13] proposed ProMiner which is based on a dictionary of synonyms to identify genes and proteins mentions in text and link them to their corresponding ids in the dictionary. [ 27 ] presented an approach based on dictionaries as well for NER in the medical domain. There are other well-known rules based NER systems such as LaSIE-II [15], NetOwl [17] and FASTUS [ 1 ]. These systems are mainly based on semantic and syntactic rules to recognize entities [20].

Among machine learning applied techniques, we quote Hidden Markov Model (HMM), Maximum Entropy, decision trees, Support Vector Machines (SVM) and Conditional Random Fields (CRF). [18] proposed a CRF model and include morphological features, Part-Of-Speech (POS) Tags, and words sequences. [16] used a CRF too, and show that using Word2Vec pre-trained word embeddings improves NER models performances.

On the other hand, neural networks based models have proved their e ciency on NER tasks. Long Short-Term Memory (LSTM) [14] based neural networks have been widely used in di erent NLP applications thanks to their ability to detect long-term dependencies. These models showed good results compared to traditional approaches, even if they do not need dictionaries, Gazetteers or other additional information. [ 6 ] presented a hybrid model by combining Bidirectional LSTM (BiLSTM) and a Convolutional Neural Network (CNN). [19] introduced a neural model similar to [ 6 ], based on BiLSTM combined with a CRF. [23] used the attention mechanism to develop a model which takes advantage of sentence level and document level hierarchical contextual representations. [ 8 ] introduced BERT, acronym of Bidirectional Encoder Representations from Transformers, \which is a language model designed for pre-training deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers" [ 8 ]. The model obtained new state-of-the-art results on eleven natural language processing tasks. For more details on related work we refer the reader to [ 10 ]. 3

Background on some Supervised ML Models

In this section we present some supervised machine learning models used in this research, such as LSTM operating model, BiLSTM which is the combination of two LSTMs, followed by a brief description of the CRF modelling model and its usefulness. 3.1

The Long Short-Term Memory model

LSTM is a RNN architecture used in the eld of deep learning. \This powerful family of connectionist models can capture time dynamics via cycles in the graph" [14, 24].

RNNs take as input a vectors sequence (x1; x2; :::; xn) at time t and return the hidden state vectors sequence (h1; h2; :::; hn), which stocks information learned in actual and previous steps. \Although RNNs can, in theory, learn long dependencies, in practice they fail to do so and tend to be biased towards their most recent inputs in the sequence"[ 2 ]. Thanks to their memory cells, LSTMs are able to resolve this point by capturing useful information from previous states[14]. A LSTM unit is updated at a time t using the following equations: it = (Wiht 1 + Uixt + bi) ft = (Wf ht 1 + Uf xt + bf ) c~t = tanh(Wcht 1 + Ucxt + bc) ct = ft ct 1 + it

c~t ot = (Woht 1 + Uoxt + bo) ht = ot tanh(ct) (1) (2) (3) (4) (5) (6) where: [14, 24].

is the element-wise sigmoid function is the element-wise product xt is the input vector at time t ht is the hidden state (could be called output) vector stocking useful information at (and before) time t Ui , Uf , Uc , Uo are the weight matrices of di erent gates for input xt Wi,Wf ,Wc,Wo denote the weight matrices for hidden state ht bi, bf , bc, bo are the bias vectors 3.2

Bidirectional LSTM

A LSTM computes a hidden state vector !ht representative of the left context in a sentence at every step t [19]. To take advantage of information that we could get from treating the same sentence in reverse, [ 9 ] proposed the BiLSTM model. The idea is to use an another LSTM, to generate a second hidden state vector ht representative of the right context in the sentence. Concatenating these two vectors leads to a representation ht=[!ht ;ht ] of the word in its general context. The resulting representation is useful for numerous tagging applications [19]. 3.3

CRF \A very simple but surprisingly e ective tagging model is to use the ht's as features to make independent tagging decisions for each output yt"[21]. In sequence labeling tasks, taking into consideration neighboring labels could be helpful while analyzing a given input sentence, like in some \grammar" rules where a noun more likely follows an adjective than a verb, this can be equivalent in NER to the fact that I-ORG cannot follow I-PERS [24].

Therefore, as in the research presented by [19], we model label sequence jointly using a CRF, instead of modeling them independently. 3.4

Extracting Character Features Using a CNN

CNN layers have become ubiquitous in many NLP tasks. As in [ 6 ], we use for each word a convolution and a max layer to extract a per-character feature and optionally the character type to obtain at the end a new character embedding with resulting features. A special token \PADDING" has been used on both sides of words to keep the same length of sequences. In this section we present a brief description of CNN applied to text [ 31 ].

Input Layer An input sequence x with n elements, each one is represented by a d-dimensional vector, can be represented as a map of features of dimensionality d x n. The bottom of gure 1 shows the input layer as a rectangle with multiple columns [ 31 ].

Convolution Layer Convolution layer is employed to represent learning by sliding w-grams over an input sequence (x1; x2; :::; xn). We consider a vector ci

Rwd as the concatenated embeddings of w entries (xi w+1; :::; xi), where w is the lter width and 0 < i < s + w. We pad embeddings of xi , where i < 1 or i > n, with zeros. We then represent the w-grams (xi w+1; :::; xi) by a new vector pi Rd using the convolution weights W Rd wd: pi = tanh(W ci + b) (7) where b Rd is the bias [ 31 ].

Maxpooling We use w-gram representations pi (i = 1:::s+w 1) to generate the input sequence x representation by applying maxpooling: xj = max(p1;j; p2;j; :::) where (j = 1; :::; d) [ 31 ]. 3.5

Attention mechanism

\The attention mechanism has become an integral part of compelling sequence modeling and transduction models in various tasks"[ 29 ]. This technique allows to represent the context in a sequence by taking into consideration neighboring words [ 29 ]. For more details on attention mechanism we refer the reader to [ 29 ]. 4

Our Proposed Approach

This study aims to compare the e ectiveness of our proposed attentive neural approach in recognizing and classifying NEs in historical newspapers with a comparison to a statistical model augmented with orthographic features and two other neural models. 4.1

Statistical approach

Statistical approaches based on CRF, SVM or Perceptron have proven good performances using only handcrafted features in many NLP tasks such as NERC [ 6 ]. In our work, we use CRFsuite2 implementation of CRF provided by HIPE team. Among all CRFs implementations, CRFsuite is the fastest one for training the model and labeling data.

In our baseline we use orthographic basic spelling features extracted from words such as pre x and su x, the casing of the initial character, and whether it is a digit. 4.2

Neural network approach

In this section we present our NER neural model followed by a brief description of used input embeddings and additional features.

Proposed NER Model As in [ 6, 19 ], we use in our architecture BiLSTM layers for the extraction of word-level features. These layers are followed by an attention layer. We also use a CRF layer on the top of our model, augmented with some features such as dropout layers. Figure 2 presents our proposed architecture. Apart from using word representations, we also use character representations to extract morphological and orthographic features. As shown in Figure 2, word embeddings are given to a BiLSTM. li and ri represent the word i in its left and right contexts respectively. The concatenation of these two vectors represent the word's context ci [19]. 2 http://www.chokkan.org/software/crfsuite/

Input Embeddings The input layers of our model are vector representations of words. \Learning independent representations for word types from the limited NER training data is a di cult problem: there are simply too many parameters to reliably estimate" [19]. In our study, we use pre-trained contextualized word embeddings to initialize our look-up table and to enrich our training dataset. In our experiments, we use indomain Flair embeddings provided by HIPE organizers. \These embeddings were computed with a context of 250 characters, 1 hidden layer of size 2048, and a dropout of 0.1. Input was normalized with lowercasing, replacement of digits by 0, everything else was kept as in the original text" [ 11 ]. Extracting character-level representations allows us to take advantage of features related to the domain in hand. Following [ 6 ], we use a CNN layer to represent each word based on its characters. We initialize a lookup table randomly with values between 0.5 and 0.5 to generate a character representation of 25 dimensions. The character set is formed by all characters present in the dataset, with PADDING and UNKNOWN tokens, used for the CNN and all other characters respectively. Figure 3 [ 6 ] presents an example where we give the word \Picasso" characters embeddings to a CNN. Additional features As information related to capitalisation has been removed during word embeddings' map construction. We used a separate look-up table to add this feature with the following options: allCaps (the word is in capital letters), upperInitial (only the rst letter is capitalized), lowercase (the word is lower cased) and mixedCaps (capital and small letters are mixed) [ 7, 6 ]. In our work, we used additional character-based features as well, by using a lookup table, to generate a vector which represent the character's type (uppercase, lowercase, punctuation or other). 5

Experiments and Evaluations

In this section we present our experiments and the results obtained with di erent models. 5.1

Task Description

The CLEF HIPE 2020 shared task includes two NE processing tasks with subtasks of di erent level of di culty. In our work we participate in NERC coarsegrained sub-task of the NERC task. This task includes the recognition and classi cation of entity mentions according to high-level entity types. In our case the types used for annotations are: LOC, ORG, PERS, PROD and TIME. 5.2

Training data

\The shared task corpus is composed of digitized and OCRed articles originating from Swiss, Luxembourgish and American historical newspaper collections and selected on a diachronic basis"[ 11 ].3 Table 1 shows an overview of the French corpus statistics.

Datasets #docs #tokens #mentions %noisy Train

Dev

Test All Training and implementation details

As in [ 6 ] we use in our experiments the IOB tagging scheme which stands for Begin, Inside and Outside. This schema allows us to mark the position of the word in the named entity. We implement our model using Keras library with Tensor ow as a backend. As in [ 6 ], we initialize LSTM states with zero vectors. Except for the character and word embeddings whose initializations have been described previously, we initialize all lookup tables randomly. We train our model with mini-batchs using nadam optimization algorithm. As in [19] we use a single layer for both forward and backward LSTMs and we apply dropout layers to make our model learn from word and character features. Furthermore, applying dropout was e ective in reducing over tting and improving our model's performance. 5.4

Evaluation Measures

NERC task in CLEF HIPE 2020 shared task is evaluated in terms of Precision, Recall and F-measure (F1). Evaluation is done at entity level according to two metrics: micro average, with the consideration of all TP, FP, and FN 4 over all documents, and macro average, with the average of document's micro gures. NERC bene ts from strict and fuzzy evaluation regimes. For NERC, the strict regime corresponds to exact boundary matching and the fuzzy to overlapping boundaries [ 11 ]. For more details on evaluation metrics we refer the reader to [ 11 ]. 3 From the Swiss National Library, the Luxembourgish National Library, and the Library of Congress (Chronicling America project), respectively. Original collections correspond to 4 Swiss and Luxembourgish titles, and a dozen for English. 4 True positive, False positive, False negative 5.5

Models Evaluation

In this section we evaluate di erent used models which have been already described above. In table 2 we cite main di erences between models.

Models Model 1 Model 2 Model 3 Our model

3 7 7 7

Statistical Orth. features Neural Cont. WE Att. mech.

3 7 7 7 7 3 3 3 7 7 7 3 7 7 3 3

Results Tables 3, 4, 5 and 6 show a comparison of results obtained with di erent studied models.

Discussion According to the results presented in tables 3 and 4, we notice on the one hand that the use of in domain contextualized word embeddings and attention mechanism lead to a higher F-measure for LOC, PROD and TIME entities compared to all other models in both fuzzy and strict regimes. On the other hand the statistical model augmented with orthographic features performs better in both ORG and PERS entities, this could be explained by the importance of syntactic information provided by these features and the large portion of information that they encode which are essential in the NERC task.

Now if we consider metonymic sense, according to table 5, all neural models perform better than the statistical model augmented with orthographic features in both regimes, and our model has higher scores than the two other neural models, except model 3 which has higher F-measure in strict regime. Moreover, even if other models have higher precision, our model showed higher recall which lead to a higher F-measure. We are convinced by the fact that \actively tackling the problem of OCR noise and hyphenation issues helps to achieve better recall"[ 11 ]. These results show that neural models especially our proposed model, where we use contextualized word embeddings and attention mechanism, perform far better than the statistical model on all entities when it is about metonymic sense.

Now if we consider table 6, we notice that model 2 and model 3 perform better than the statistical model and barely better than our proposed model on the ORG entity, which shows that these models were more able to generalize on test data in this stage.

All these improvements prove the e ciency of our neural model architecture and of di erent features used in training, especially contextualized word embeddings trained on large quantities of raw data and character embeddings extracted from speci c domain dataset. Therefore, our neural model is able to extract necessary knowledge from training data, without using handcrafted features.

An important aspect of the CLEF HIPE 2020 shared task corpus, and for historical newspaper data in general, is the noise generated by OCR. As reported in [ 11 ], noisy mentions a ect remarkably the model's performance: \little noise as 0.1 severely hurts the system's ability to predict an entity and may cut its performance by half"[ 11 ]. In our study, we do not report results obtained on the dev set as in the nal step, after using dev set to ne tune our model's parameters, we used train and dev sets for training. However, we would like to con rm the degradation of our model's performance, caused in part by the fact that \11 % of all mentions in test set contain OCR mistakes"[ 11 ].

3 2 1 2 0 2 F 4 4 .

3 3 P 24 0 2

4 . .

8 6 1 6 0 6 F 4 4

. 3 l e d 2 l e d o M 1 l e d o M y F z 8 4 z u R 46 0 6

4 . .

8 8 P 64 0 6

4 . .

3 1 1 3 0 3 F 4 . m ( . ) e g a r e v a o r c i s e i t i t n e f o e s n e s c i y n o t e n i r e d i s n o c , h c n e r m F n i e s r a o C C R E N r o f s t l u s e r s l e d o m r u O a

5 5 P 65 0 6

5 . . .

8 8 R 27 0 7

2 . .

5 5 P 65 0 6

5 . . l e

E L

G ab R IML

A L O T

Conclusion

In this paper, we presented a hybrid approach for NERC applied on historical newspapers. In our experiments, we used orthographic features related to words syntax. Besides, we used word and character embeddings, which allow us to detect morphological and orthographic features related to a speci c domain. Our experiments show an improvement in the overall performance. We notice that our attentive neural model augmented with contextualized word embeddings performs better compared to our baselines overall. To the best of our knowledge, there is no study which combines the application of attention mechanism and contextualized word embeddings in NERC for historical newspapers domain.

As a future work, we aim to investigate the usefulness of adding additional features in the hybrid architecture and the use of external resources such as ontologies and other knowledge and common sense bases. Applying multi-task learning will be part of our future work, as well. Moreover, it would be relevant to apply explainability techniques on the neural network models in order to better explain and analyze the results. editors, Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF 2020), volume 12260 of Lecture Notes in Computer Science (LNCS). Springer, 2020. [12] A. Goyal, V. Gupta, and M. Kumar. Recent named entity recognition and classi cation techniques: A systematic review. Computer Science Review, 29(1):21{43, 2018. [13] D. Hanisch, K. Fundel, H.-T. Mevissen, R. Zimmer, and J. Fluck. Prominer: rule-based protein and gene entity recognition. BMC Bioinformatics, 6(1):S14 { S14, 2005. [14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735{1780, 1997. [15] K. Humphreys, R. Gaizauskas, S. Azzam, C. Huyck, B. Mitchell, H. Cunningham, and Y. Wilks. University of She eld: Description of the LaSIE-II system as used for MUC-7. In Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998. Morgan, 1998. [16] M. Joshi, E. Hart, M. Vogel, and J.-D. Ruvini. Distributed word representations improve NER for e-commerce. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, pages 160{167, Colorado, 2015. Association for Computational Linguistics. [17] G. R. Krupka and K. Hausman. IsoQuest Inc.: Description of the NetOwlT M extractor system as used for MUC-7. In Seventh Message Understanding Conference (MUC-7): Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998, pages 21 { 28, 1998. [18] J. D. La erty, A. McCallum, and F. C. N. Pereira. Conditional random elds: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282{289. Morgan Kaufmann Publishers Inc., 2001. [19] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer.

Neural architectures for named entity recognition. In Proceedings of NAACL-HLT 2016, page 260{270, 2016. [20] J. Li, A. Sun, J. Han, and C. Li. A survey on deep learning for named entity recognition. CoRR, 2018. [21] W. Ling, T. Lu s, L. Marujo, R. F. Astudillo, S. Amir, C. Dyer, A. W. Black, and I. Trancoso. Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015. [22] E. Linhares Pontes, A. Hamdi, N. Sidere, and A. Doucet. Impact of ocr quality on named entity linking. In A. Jatowt, A. Maeda, and S. Y. Syn, editors, Digital Libraries at the Crossroads of Digital Information for the Future, pages 102{115. Springer International Publishing, 2019. [23] Y. Luo, F. Xiao, and H. Zhao. Hierarchical contextualized representation for named entity recognition. CoRR, abs/1911.02257, 2019. [24] X. Ma and E. Hovy. End-to-end sequence labeling via bi-directional LSTMCNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association

[1]

D. E.

Appelt ,

J. R.

Hobbs ,

Bear ,

Israel ,

Kameyama ,

Kehler ,

Martin ,

Myers , and

Tyson . SRI International FASTUS systemMUC- 6 test results and analysis . In Sixth Message Understanding Conference (MUC-6): Proceedings of a Conference Held in Columbia, Maryland, November 6-8 , 1995 , pages 237 { 248 , 1995 .

[2]

Bengio ,

Simard , and

Frasconi . Learning long-term dependencies with gradient descent is di cult . IEEE Transactions on Neural Networks , 5 ( 2 ): 157 { 166 , 1994 .

[3]

Bollmann . A large-scale comparison of historical text normalization systems . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long and Short Papers). Association for Computational Linguistics , 2019 .

[4]

Borin ,

Kokkinakis , and

L.-J.

Olsson . Naming the past: Named entity and Animacy recognition in 19th century Swedish literature . In Proceedings of the Workshop on Language Technology for Cultural Heritage Data (LaTeCH 2007 )., pages 1 { 8. Association for Computational Linguistics, 2007 .

[5]

Chiron ,

Doucet ,

Coustaty ,

Visani , and

Moreux . Impact of ocr errors on the use of digital libraries: Towards a better access to information . In 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL) , pages 249 { 252 . IEEE Press, 2017 .

[6]

J. P. C.

Chiu and

Nichols . Named entity recognition with bidirectional lstm-cnns . Transactions of the Association for Computational Linguistics , 4 ( 1 ): 357 { 370 , 2015 .

[7]

Collobert ,

Weston ,

Bottou ,

Karlen ,

Kavukcuoglu , and

Kuksa . Natural language processing (almost) from scratch . Journal of Machine Learning Research , 12 ( 1 ): 2493 { 2537 , 2011 .

[8]

Devlin ,

Chang ,

Lee , and

Toutanova . BERT: pre-training of deep bidirectional transformers for language understanding . CoRR , abs/ 1810 .04805, 2018 .

[9]

Dyer ,

Ballesteros ,

Ling ,

Matthews , and

N. A.

Smith . Transition-based dependency parsing with stack long short-term memory . volume 1 , page 334 { 343 , 2015 .

[10]

Ehrmann ,

Romanello , A.

Fluckiger, and

Clematide . Extended Overview of CLEF HIPE 2020 : Named Entity Processing on Historical Newspapers . In L. Cappellato,

Eickho ,

Ferro , and A . Neveol, editors, CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum. CEUR-WS , 2020 .

[11]

Ehrmann ,

Romanello , A.

Fluckiger, and

Clematide . Overview of CLEF HIPE 2020 : Named Entity Recognition and Linking on Historical Newspapers . In A. Arampatzis , E. Kanoulas, T.

Tsikrika , S.

Vrochidis , H.

Joho , C.

Lioma , C.

Eickho , A.

Neveol , L.

Cappellato , and N.

Ferro, for Computational Linguistics (Volume 1: Long Papers) , pages 1064 { 1074 . Association for Computational Linguistics, 2016 .

[25]

Nadeau and

Sekine . A survey of named entity recognition and classication . Lingvisticae Investigationes , 30 ( 1 ):3{ 26 , 2007 .

[26]

Neudecker and

Antonacopoulos . Making europe's historical newspapers searchable . 2016 12th IAPR Workshop on Document Analysis Systems (DAS) , pages 405 { 410 , 2016 .

[27]

A. P.

Quimbaya ,

A. S.

Munera ,

R. A. G.

Rivera , J. C. D.

Rodr guez,

O. M. M. Velandia , A. A. G.

Pen~a, and

Labbe . Named entity recognition over electronic health records through a combined dictionary-based approach . Procedia Computer Science , 100 ( 1 ): 55 { 61 , 2016 .

[28]

D. A.

Smith and

Cordell . A Research Agenda for Historical and Multilingual Optical Character Recognition . Tech. rep . 2018 .

[29]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez ,

Kaiser , and I. Polosukhin. Attention is all you need . CoRR, abs/1706.03762 , 2017 .

[30]

Yadav and

Bethard . A survey on recent advances in named entity recognition from deep learning models . In Proceedings of the 27th International Conference on Computational Linguistics , pages 2145 { 2158 . Association for Computational Linguistics, 2018 .

[31]

Yin ,

Kann ,

Yu , and

Schu tze. Comparative study of CNN and RNN for natural language processing . CoRR, abs/1702 . 01923 , 2017 .