Towards Entity Linking, NER in Archival
                       Finding Aids*

    Luı́s Filipe da Costa Cunha1 and José Carlos Ramalho2[0000 0002 8574 1574]
               1
                 University of Minho, Portugal a83099@alunos.uminho.pt
     2
         Department of Informatics, University of Minho, Portugal jcr@di.uminho.pt


          Abstract. The amount of information present in Portuguese archives
          has been increasing exponentially over the years. At the moment, the
          majority of the data is already available to the public in digital format,
          however, the records are stored as unstructured text, making its data
          processing challenging.
          In this way, it is intended to perform a semantic interpretation of these
          documents through the identification and classification of Named Enti-
          ties. For this purpose, the use of Natural Language Processing tools is
          proposed, training Machine Learning algorithms capable of accurately
          recognizing entities in this context.
          Finally, it is presented a Web platform that implements all the models
          trained in this paper, as well as some tools that gave support to the
          entity extraction process.

          Keywords: Archival Descriptions · Named Entity Recognition · Ma-
          chine Learning · Web


1        Introduction
At the moment, in Portugal, there are hundreds of archives spread across the
country that keep a diverse universe of archival patrimony in custody. Of these,
it is interesting to highlight three archives, the Arquivo Nacional da Torre do
Tombo, the Arquivo Distrital da cidade de Coimbra and the Arquivo Distrital
da cidade de Braga. These are considered historical archives since they preserve
records of various important events that took place throughout the history of
the country.
    Nowadays, most of the records stored in these archives are already available
to the public and can be consulted online via Web portals such as Digitarq
[1] or Archeevo [10]. Despite this, the data provided does not have any kind of
annotations being served as natural text, which can cause difficulty in processing
and analyzing this type of data.
    Thus, it is intended to perform entity recognition in these documents, using
Machine Learning(ML) tools, a technique that has been showing excellent results
in Natural Language Processing.
    In fact, there are already several ML models optimized to extract entities
from Portuguese documents, however, the models found were trained in di↵er-
ent contexts which means that when applied to archival documents, they reveal
______________
* Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).
2       L. F. Cunha and J. C. Ramalho.

results below the intended. Thus, in order to enhance the entity extraction ac-
curacy, new ML models were trained.
    Finally, after implementing the entity recognition mechanism, a Web plat-
form was developed and deployed in order to make the generated tools available
to the public.


2   Related Work
The study of archive files is something that has been done over the years and
of the available computational power is not something new for professional his-
torians. In fact, there are several tools that have been developed over time that
assist in the archival data processing.
    An example of this is the HITEX [13] project, developed by the Arquivo
Distrital de Braga between 1989 and 1991. This project consisted of semantic
model development for the archive historical data, something quite ambitious
for that time. Despite this, during its development, it ended up converging to an
archival transcription support system, which allowed the transcription of natural
text and the annotation by hand of Named Entities enabling the creation of
chronological, toponymic and anthroponomic indexes.
    Another problem associated with this type of documents was its structure’s
lack of standardisation. This made it difficult to share information between the
archival community both nationally and internationally. To promote data inter-
operability, in Portugal, guidelines for the archival description have been created
that describe rules for standardising the archival descriptions [16]. The purpose
of these standards is to create a working tool to be used by the Portuguese
archivist community in creating descriptions of the documentation and its en-
tity producer, thus promoting organisation, consistency and ensuring that the
created descriptions are in accordance with the associated domain’s international
standards. The adoption of these guidelines makes it possible to simplify the re-
search or information exchange process, whether at the national or international
level.


3   NER Tools
In order to extract entities from archival documents, a subfield of Natural Lan-
guage Processing was used, Named Entity Recognition. This subject focuses
on identifying and classifying Named Entities in text documents, in this case,
archival finding aids.
    To recognize entities in natural text, one can resort to several mechanisms,
such as the simple use of regular expressions, although, some approaches are con-
sidered more flexible than others [8]. In fact, nowadays this activity is usually
associated with the use of ML tools, which have been showing increasingly accu-
rate results. Initially, Portuguese pre-trained models trained with the HAREM
[5] and SIGARRA [15] datasets were used, however, due to these datasets con-
taining data of a di↵erent nature to the context of this paper, the results obtained
                     Towards Entity Linking, NER in Archival Finding Aids        3

were below than intended so new training data was generated in order to train
new models from scratch.
    In this paper, three distinct examples of NER implementations using ML
statistical models will be presented, such as the use of di↵erent kinds of Neural
Networks and the Maximum Entropy algorithm.


3.1   OpenNLP - Maximum Entropy

The first presented tool to perform NER is Apache OpenNLP [14] which consists
of a machine learning-based toolkit, developed in Java programming language,
that presents a wide range of ML features for NLP, entity recognition being one
of them. To recognize entities in unstructured texts, this tool uses a Maximum
Entropy statistical model which, in short, consists of maximizing the entropy of
a probabilistic distribution subjected to an N number of constraints [11].
    The core of this algorithm is to define a set of features that allow the intro-
duction of known information about the problem domain. Then, the function of
Information Entropy is used to maximize the entropy of the models that satisfy
the restrictions imposed by the previously selected features, in order to choose
the model that makes the smallest implicit assumptions possible.


3.2   spaCy - Convolutional Neural Network

Another tool used to experiment the entity recognition potential in this domain
was spaCy [17], an open-source library developed in Python, for advanced natural
language processing.
    Again, this tool approaches NER with the use of ML algorithms, this time
with the use of Deep Learning, Convolution Neural Networks(CNN).
    In fact, the use of Deep Learning in this area is more and more common due to
the results that this approach has shown. In this case, spaCy uses a transition-
based approach [18], i.e., a system that has a set of actions at its disposal,
for example, associating an entity label with a certain token or not. Thus, the
challenge of this approach is to determine what action to take. For this, a “Deep
Learning framework” is implemented, which helps the system to predict the
action to be taken, in favor of the Named Entities’ correct identification and
classification.


3.3   TensorFlow - BI-LSTM-CRF

The last tool used in this paper was TensorFlow, an open-source library focused
on ML features that allow to develop and train models in a similar way to the
learning method of the human mind. Using this tool, it is intended to create
a system capable of recognizing entities in Portuguese archive texts. For this,
it was necessary to implement tokenizer mechanisms, create a vocabulary and
generate word embeddings from this vocabulary, in order to create and train the
NER statistical model.
4       L. F. Cunha and J. C. Ramalho.

    Usually, TensorFlow is associated with the use of Deep Learning, and there
is a kind of Neural Networks that is really good at processing sequential data,
Recurrent Neural Networks (RNN) [6], which makes them the perfect algorithm
for analyzing unstructured text. Despite this, it was previously demonstrated
by the research community that this algorithm alone, lacks important features
when it comes to NER. Thus, an “upgraded” version of it was used, a Bidirec-
tional Long Short Term Memory (BI-LSTM) with a Cross Random Field (CRF)
component on top of it.
    In short, an LSTM consists of a Recurrent Neural Network (RNN) to which a
memory component has been added, that allows an RNN to be able to preserve
Long Term Dependencies along its chain [12]. That said an RNN is unidirectional
and, in order to accurately classify a token’s label, the model must take into
account the context of the token’s neighborhood in both directions. Thus, two
LSTMS are used, one responsible for the previous and the other for the future
context creating a BI-LSTM. Finally, on top of this is added a CRF component
that is responsible for encoding the best tagging sequence, boosting the tagging
accuracy [9].
    Thus, a BI-LSTM-CRF is generated, the model that obtained state-of-the-art
results on several NLP tasks in 2015 [7].


4   Models’ Results

One of the main objectives of this entity recognition was to optimize the obtained
results on a set of metrics: Precision, Recall and F1-Score. Thus, it is necessary
to train new models so that the environment in which they are trained is as close
as possible to the target context.
    In order to train the ML models, it was necessary to create training data
associated with the context of the archives documents. Thus, a set of national
archives corpus was selected in order to begin the process of annotating a repre-
sentative fraction of each one. In total, 6302 sentences were annotated which are
constituted by more than 160000 tokens. As for entities count, 17279 names of
People, 6604 Places, 2980 Dates, 978 Professions or Titles and 843 Organizations
were annotated, making a total of 28684 entities.
    After being annotated, each dataset was separated into two parts, with 70%
of each used for training and 30% for model validation. With the validation and
training data ready, the models were trained and then validated. During the
training process, individual optimizations for each tool were performed in order
to obtain the best possible results, for example, defining hyper-parameters. Then
the validation process started and, as can be seen in the Table 1, very satisfactory
results were obtained.
    In this case, deep learning was clearly a winner in this NLP subfield. In
fact, OpenNLP achieved lower results than other tools obtaining F1-score val-
ues between 69.80% and 99.71% followed by spaCy which achieved values be-
tween 75.98% and 99.94% and finally Tensorflow with the BI-LSTM-CRF model
achieving values between 78.89% and 100%.
                      Towards Entity Linking, NER in Archival Finding Aids     5

             Corpus           Tool    Precision(%) Recall(%) F1-Score(%)
                           OpenNLP       89.43       83.60      86.41
              IFIP           spaCy       86.99       88.71      87.84
                           TensorFlow    92.84       96.85      94.08
                           OpenNLP       81.94       63.67      71.66
         Famı́lia Araújo
                             spaCy       75.19       76.78      75.98
            Azevedo
                           TensorFlow    78.22       82.47      78.89
                           OpenNLP       88.84       81.68      85.11
           Arquivo da
                             spaCy       87.18       87.18      87.18
          Casa Avelar
                           TensorFlow    86.83       92.21      87.99
                           OpenNLP       99.60       99.53      99.57
           Inquirições
                             spaCy       98.31       96.74      97.52
          de Genere 1
                           TensorFlow     100         100        100
                           OpenNLP       74.70       65.61      69.80
           Inquirições
                             spaCy       79.96       92.21      87.26
          de Genere 2
                           TensorFlow    93.70       98.34      94.82
                           OpenNLP       99.71       99.71      99.71
          Paróquia do
                             spaCy       99.15        100       99.57
         Jardim do Mar
                           TensorFlow     100        99.60      99.72
                           OpenNLP       93.49       99.69      96.49
          Paróquia do
                             spaCy       99.98       99.90      99.94
        Curral das Freiras
                           TensorFlow     100         100        100
                   Table 1. Named Entity Recognition Results.


5   Web Platform

Throughout this project, several tools were generated that allowed to facilitate
and support its development. In order to encourage the investigation of this
area of NLP, applied to archival documents, all the produced material was made
public through the creation of a Web platform.
    This platform serves as a portfolio of the project, implementing several pro-
duced mechanisms with the main objective of allowing its users to take advantage
of the ML models generated with the three tools, OpenNLP, spaCy and Tensor-
Flow, enabling the execution of Named Entity Recognition in new unstructured
text documents. The purpose of creating this platform is to make available the
tools created to the community, which contain the following features:

 – Enables users to perform Named Entity Recognition with three di↵erent ML
   statistical models.
 – Enables sorting the results by entity type, alphabetical ordering and repeated
   entities filtering.
 – Supports the import of text files as input of the NER ML models.
 – Export of extracted entities in CSV and JSON file formats.
 – Presents results from previous entities recognition so that it is possible to
   verify real cases application of each model in several di↵erent datasets.
 – All annotated datasets are available for download in BIO format.
6         L. F. Cunha and J. C. Ramalho.

    – Presents various dataset formats that are used in this subfield, such as CSV
      and BIO providing parses that allow the conversion of datasets between
      di↵erent formats.

    It is important to mention that all available ML models on this platform
were trained with archival documents, that is, they are expected to be able to
perform in similar contexts. In this way, when using these models to recognize
entities in documents of a di↵erent nature, poor results are expected.
    Finally, it is also interesting to mention that, with the use of ML statistical
models in archival fonds, it was possible to perform an entity extraction that
resulted in hundreds of thousands of extracted named entities. This result can
be observed on the platform.


5.1     Implementation

The Web application was designed with micro-services-based architecture and
has two micro-services that correspond to the back end, which was developed
in Node.js, Python and Java, and the front end, implemented in Vue.js and
complemented with the Vuetify framework.
    The back end server is responsible for receiving, processing and responding
to HTTP requests. In this way a node.js server complemented by the Express [3]
library , is responsible for managing all API routes and, when necessary, delegates
the data processing to the corresponding tools. This happens for example in
NER requests, which are processed by the ML models of OpenNLP, spaCy and
TensorFlow. In this way, the node.js’s child process [2] library is used in order
to create child processes that execute programs in Java and Python, waiting for
the output of their execution, and then forwarding the response to the client.
    On the other hand, the front end was developed with a progressive javascript
framework for creating reactive interfaces, Vue.js. This tool is focused on the
view layer (client-side). It has a small learning curve so it is fairly approach-
able, allowing the creation of a performant and maintainable interface due to its
reusable components mechanism that allows isolating all logic from the views.
    Finally, docker images of the application were created for its deployment, so
at the moment, it is hosted on the servers of the Department of Informatics,
University of Minho at [4].


6      Conclusion

As demonstrated in the validation of the ML models, this NER technique reveals
great potential in this context, obtaining F1-score values greater than 80% in
most of the tested corpus. It is also important to note that the algorithms that
take advantage of Deep learning obtained better results. Furthermore, analyzing
the available results on the Web platform, the trained models were able to extract
hundreds of thousands of Named Entities from archival fonds by annotating only
a small fraction of them and use that fraction for training the tools.
                      Towards Entity Linking, NER in Archival Finding Aids            7

    Thus, it is concluded that the use of ML tools to extract entities from archival
documents is a viable approach, and it creates the opportunity to generate dif-
ferent navigation mechanisms and create relations between information records.


7    Future Work
One way to improve the obtained results in entity recognition is to increase the
amount of annotated data. In fact, training models with a larger data set makes
them able to perform in a wider variety of contexts. Another way to improve the
model’s results would be to improve the used technologies. The the Attention
Mechanism [19] has shown innovative results in this NLP sub-field, so it would
be interesting to test this technology in the archive context.
    On the other hand, the extracted entities translate into valuable informa-
tion about their corpus. These data can be explored to implement new tools,
for example, Entity Linking mechanisms, enabling navigation between di↵erent
documents through the relationship between entities.
    Finally, in order to complement the created Web platform, it would be inter-
esting to use the trained ML models to create a support tool for unstructured
text annotation, taking advantage of Active Learning techniques.


References
 1. Arquivo nacional torre do tombo, https://digitarq.arquivos.pt/, accessed in 18-04-
    2021
 2. Node.js v16.4.0 documentation, https://nodejs.org/api/child process.html, ac-
    cessed in 17-03-2021
 3. Node.js web application framework, https://expressjs.com/, accessed in 10-04-2021
 4. Costa Cunha, L.F., Ramalho, J.C.: http://ner.epl.di.uminho.pt/
 5. Freitas, C., Mota, C., Santos, D., Oliveira, H.G., Carvalho, P.: Second
    HAREM: Advancing the state of the art of named entity recognition in
    Portuguese. In: Proceedings of the Seventh International Conference on
    Language Resources and Evaluation (LREC’10). European Language Re-
    sources Association (ELRA), Valletta, Malta (May 2010), http://www.lrec-
    conf.org/proceedings/lrec2010/pdf/412 Paper.pdf
 6. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep
    recurrent neural networks. In: ICASSP, IEEE International Confer-
    ence on Acoustics, Speech and Signal Processing - Proceedings (2013).
    https://doi.org/10.1109/ICASSP.2013.6638947
 7. Huang, Z., Xu, W., Yu, K.: Bidirectional lstm-crf models for sequence tagging
    (2015)
 8. Ingersoll, G.S., Morton, T.S., Farris, A.L.: Taming text: how to find, organize, and
    manipulate it. Manning, Shelter Island (2013), oCLC: ocn772977853
 9. Ma, X., Hovy, E.: End-to-end sequence labeling via bi-directional LSTM-CNNs-
    CRF. In: 54th Annual Meeting of the Association for Computational Linguistics,
    ACL 2016 - Long Papers (2016). https://doi.org/10.18653/v1/p16-1101
10. Arquivo Regional e Biblioteca Pública da Madeira, A.: https://arquivo-
    abm.madeira.gov.pt/, accessed in 10-03-2021
8       L. F. Cunha and J. C. Ramalho.

11. Maxent,       O.:      The      maximum         entropy      framework       (2008),
    http://maxent.sourceforge.net/about.html, accessed in 24-09-2020
12. Olah,      C.:      Understanding        lstm     networks       (August      2015),
    http://colah.github.io/posts/2015-08-Understanding-LSTMs/,          accessed     on
    March 10, 2021
13. Oliveira, J.N.: Hitex: Um sistema em desenvolvimento para historiadores e arquiv-
    istas. Forum (1992)
14. OpenNLP, A.: Welcome to apache opennlp (2017), https://opennlp.apache.org/,
    accessed in 18-10-2020
15. Pires, A.R.O.: Named entity extraction from Portuguese web text. Master’s thesis,
    Faculdade de Engenharia da Universidade do Porto (2017)
16. Rodrigues, A.M., Guimarães, C., Barbedo, F., Santos, G., Runa, L.,
    Penteado, P.: Orientações para a descrição arquivı́stica (May 2011),
    https://act.fct.pt/acervodocumental/documentos-tecnicos-e-normativos/
17. spaCy: spacy 101: Everything you need to know · spacy usage documentation,
    https://spacy.io/usage/spacy-101, accessed in 07-01-2021
18. spaCy: Model architecture (2017), https://spacy.io/models, accessed in 14-01-2021
19. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Lukasz
    Kaiser, Polosukhin, I.: Attention is all you need. vol. 2017-December (2017)