<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Named Entity Recognition Applied to Portuguese Texts from the XVIII Century</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>CIDEHUS, Universidade de E</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Federal University of Rio Grande do Sul</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Surrey</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Extracting data and knowledge dispersed along Portuguese old medical records is important especially for researchers dealing with historical epidemiology and health sciences. An essential task in Natural Language Processing for processing textual information is Named Entity Recognition (NER). In this paper, our main objective is to test the performance of NER systems for Portuguese for extracting information from XVIII-century medical texts, so that we can provide an annotated version of an important work of this type.</p>
      </abstract>
      <kwd-group>
        <kwd>NER</kwd>
        <kwd>XVIII-Century Portuguese</kwd>
        <kwd>Historical Medicine</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Besides the new advances in Deep Learning, Machine Learning and Neural
Models, text mining techniques ofer an important input for extracting
knowledge from large collections. These techniques can help the work of philology
researchers, historians and physicians that deal with the history of Medicine
and Epidemiology. They provide efective ways to obtain, integrate and
interpret data collected from diferent sources, and they can reduce the required time
to process information from vast textual material in a non-linear approach [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Extracting data and knowledge dispersed among old medical records is
important especially for researchers dealing with the historical epidemiology (HE)
and health sciences [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. HE holds the promise of creating a more robust and
nuanced foundation for global public health decision-making by deepening the
empirical records from which we draw lessons about past interventions. Several
of these interventions are narrated, for example, in medical manuals published
in Portuguese in the XVIII century. However, facing the complexity of old texts
and understanding the information they bring is not a trivial task.
      </p>
      <p>
        Within the scope of Information Science and the data systematization of
dispersed texts [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the gap in working with old documents and collections is also
recognized as a challenge. Although there is great interest in the consideration of
⋆ Copyright © 2022 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
temporal aspects associated with Information Retrieval (IR), Schiel [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] states
that he had not found any work focused on determining the temporal context
of a concept and its correlations according to the system of the time involved or
even relating it to the frameworks of current knowledge.
      </p>
      <p>
        To process and represent this kind of textual information, an essential task
in Natural Language Processing (NLP) is Named Entity Recognition (NER).
It corresponds to the recognition and categorization of entities mentioned in a
text sample or corpus. Examples of named entities are proper names, events,
places, temporal and quantitative data, etc. These extracted entities can then
be further mapped to a knowledge base (KB), in special a Digital Humanities
KB (DHKB) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These DHKBs allow for the identification and combination of
existing knowledge about historical facts from diferent sources.
      </p>
      <p>In this paper, our main objective is to identify a NER system for Portuguese
that would work for extracting information from XVIII-century medical texts4.
To this end, we first create a gold standard based on modernized
transcriptions of three text samples and then evaluate the systems’ named entity (NE)
extraction against this gold standard. As a second step, we contrast the
systems’ NE extraction performed on the modernized transcriptions against their
own NE extraction from non-modernized transcriptions, which present
diferent spelling and syntax. After evaluating the best NER system based on these
two experiments, our next objective is to conduct a full NE extraction from an
XVIII-century medical corpus.</p>
      <p>The remainder of this paper is divided as follows: Section 2 presents previous
studies related to NER and old texts; Section 3 describes the corpora, the gold
standard, and the three NER systems that we used, while also explaining the
procedures for two experiments; Section 4 contains the results of the experiments,
presenting a quantitative and qualitative analysis; finally, in Section 5, we recap
the main contributions of this paper and discuss future research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Quaresma &amp; Finatto [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] developed a set of initial experiments involving NER
using the medical Spanish handbook Observaciones de Curvo, written by
Francisco Suarez de Ribera in 1735, which is based on the 1707 Portuguese Curvo
Semedo’s work Observac¸oes medicas doutrinaes de cem casos gravissimos . The
processing steps were done by applying NLP tools without any human
intervention: from the OCR output to the creation of an ontology. Considering a sample
of 10% of the extracted NEs, the authors identified a precision of 21% for
locations, 22% for persons, and 5% for events. The authors also report that most of
the errors occurred because of the low quality of the OCR result.
      </p>
      <p>
        Regarding previous work on NER applied to old Portuguese, we have the
work done with the Parish Memories. A digital version of these manually
tran4 An ongoing corpus composed of medical manuals is available on the
Historical Terminology section of the Textecc project http://www.ufrgs.br/textecc/
terminologia/.
scribed texts is freely available through the CIDEHUS Digital Portal5. From
this collection a named entity dataset was automatically built using machine
learning and language models [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The initial entity categories considered were
person, location, and organization. This resource was made available to the
community, where texts are given with their respective lists of named entities [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. It
was based on a completely automated process, with reported accuracy measures
about 50%, where no new training was performed.
      </p>
      <p>
        There are studies involving NER in other historical languages. Hubkova´ et al.
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] present a study for NER in a Czech historical corpus. They developed a new
annotated dataset for historical NER, composed of historical newspapers, and
conducted experiments using recurrent neural networks, achieving performance
around 70%. For medieval French, Aguilar and Stutzmann [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] present a corpus
and trained a new system for legal documents of the XIII and XIV centuries.
Their performance measures are around 90%.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>In this section, we describe our corpus. We then go through the process of
creating a gold standard from a manual annotation of NEs, we describe the process
of using of-the-shelf NLP tools to recognize NEs, and finally we explain our
evaluation approach.
3.1</p>
      <sec id="sec-3-1">
        <title>Corpus</title>
        <p>In order to test NER in a broader set of historical texts, we collected three
Portuguese text samples from the same period, written by diferent authors,
from diferent domains, and using a diferent writing style.</p>
        <p>
          The first sample is from a medical handbook written by Jo˜ao Curvo Semedo
(1635-1719), a Portuguese physician from Monforte, in 1707: Observac¸oens
medicas doutrinaes de cem casos gravissimos [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This handbook presents
observations made by Semedo related to diagnosis and treatment of a hundred severe
cases. It contains rich information about the medical terminology existing at
that time, including names for treatments and diseases. Here is an extract from
Semedo’s handbook which preserves the original spelling:
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>OBSERVA C¸AM XLV.</title>
        <p>De hum mercador , a quem repentinamente assaltou huma dor de colica
ta˜o intoleravel , que estando na ef´ sacramental para commungar , o na˜o
pode fazer ; e sendo eu chamado , conheci dos grandissimos ardores ;
e continuos desejos de ourinar , e vomitar , das picadas da bexiga , e
do adormecimento da perna direita , que a tal dor era nephritica ; para
cujo remedio appliquei hum vomitorio de tres onca¸s de agua benedicta
vigorada , e tres ajudas feitas de cozimento de rim de vacca , [...]</p>
        <sec id="sec-3-2-1">
          <title>5 http://www.cidehusdigital.uevora.pt</title>
          <p>
            The second text sample come from the Gazetas Manuscritas da Biblioteca
Pu´blica de E´vora6 [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] and from the Sermons written by Fr. Vieira. The Gazetas
is a large corpus of journalistic texts from the XVIII century7. The excerpt below
contains a transcription that used modernized spelling:
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Dair´io de 23 de agosto de 1729</title>
        <p>Pelas cartas de Vasco Fernandes Ces´ar, se soube a noıct´ia, que aqui todos
ignoravam, de que El-rei o tinhafeito Conde de Sabugosa vila junto a
Viseu de que n˜ao sabemos se lhe desse senhorio.</p>
        <p>Chegou Rodrigo Ces´ar, gordo, mas n˜ao cheio, mostrou grande
desinteresse; as minas que descobriu temgrande quantidade de ouro, e se achou
um gr˜ao de meia arroba, poerm´e´ mau o clima, [...]</p>
        <p>
          Fr. Antˆonio Vieira (1608-1697) was born in Portugal, and his works are known
to this day for the complexity of the text and the sophistication of the
argumentation. One collection of his sermons is partially available — in a transcribed
version — at the Tycho Brahe Corpus [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Here is an excerpt from the beginning
of the sermon that we used in this study, in a partially modernized spelling:
SERMA˜ O | da | Primeira Dominga do Advento
Perg´ado na Capella Real, no Anno de 1652 | Amen dico vobis, non
praeteribit | generatio haec donec omnia fiant. | Lucas, XXI | I
Muitas coisas sabemos deste grande dia, todas grandes e temerosas, e
duas os´ ignoramos. Sabemos que antes do dia do Juız´o , o sol, que sıoa´
fazer o dia, se ha´-de escurecer e esconder totalmente com o mais horrendo
e assombroso eclipse que nunca viram os mortaes. [...]
        </p>
        <p>These three text samples have approximately the same size (around 2000
words). We used both modernized and non-modernized transcriptions as basis
for this study.</p>
        <p>We also have 90 transcribed observations available out of the 101 total
observations present in Semedo’s medical handbook, and we use this corpus for
extracting NEs at the end of this study.
3.2</p>
      </sec>
      <sec id="sec-3-4">
        <title>Gold standard</title>
        <p>The first step for testing how NER tools perform in historical texts was to create
a gold standard using our existing samples. The gold standard was generated
using the modernized versions of our three text samples. Two linguists, authors
of this paper, annotated all samples independently and exhaustively.</p>
        <sec id="sec-3-4-1">
          <title>6 From now on just referred to as the Gazetas, for short. 7 The extract used in this study comes from the period between 1729 and 1731.</title>
          <p>After the first round of annotation, we compared both lists and agreed upon
the entries that would go into the gold standard. This final list has a total of 262
NEs. Table 1 shows the tags in the gold standard, along with a few examples8.
In this section, we briefly describe the three NER models that we used. Two
pre-trained models were taken from spaCy’s9 NER libraries, and the third one
is a BERT-CRF model trained for Portuguese.</p>
          <p>
            The pre-trained NER models for Portuguese ofered by spaCy come with
different language model (LM) sizes. For this study, we selected both the large and
small LMs (spaCy lg and spaCy sm, respectively). These models were trained
on the WikiNER annotation [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], which do not contain all tags present in our
gold standard10. When contrasting the annotations of the spaCy models with
our gold standard, we considered that EVENT, TIME, WORK, and VALUE
instances recognized as MISCELLANEOUS were correct.
          </p>
          <p>
            Souza et al. [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ] trained a BERT-CRF model (i.e. a BERT-based embedding
model associated with a Conditional Random Fields layer) based on
BERTimbau [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], which is a BERT-based embeddings model for Portuguese. BERT-CRF
used the HAREM corpus [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] for NER training. HAREM contains a series of NE
tags: PERSON, LOCATION, ORGANIZATION, TIME, VALUE,
ABSTRACTION, EVENT, THING, WORK, OTHER. When contrasting our gold standard
with the extraction made with BERT-CRF, we evaluated ABSTRACTIONS and
THINGS to OTHER.
3.4
          </p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>Evaluation approach</title>
        <p>For the evaluation shown in the next section, we used a glossary of the NEs that
considers only the surface form and frequency in each of the text samples. This
8 The complete list contains additional annotations with tags that were not taken into
consideration in this study (e.g., MEASUREMENT, SYMPTOM. The complete
annotation is available at
https://github.com/uebelsetzer/NER_for_Portuguese_XVIII-Century_Texts/tree/main/gold
9 https://spacy.io/api/entityrecognizer
10 SpaCy uses PERSON, LOCATION, ORGANIZATION and MISCELLANEOUS.
means that we analyzed whether the systems were able to extract the correct
NEs from the samples, without taking the source segments into consideration.</p>
        <p>We first compared the systems’ performance on the modernized versions of
the samples against the gold standard. Then, as a second verification, we made
an intra-system comparison between the extraction from the modernized and the
non-modernized samples (see Section 3.1). Finally, we selected the best system
to extract NEs from the larger corpus of Semedo’s work.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>In this section we present the results of a quantitative and qualitative evaluation
of the named entity recognition (NER) performed by the tools. At the end we
describe the annotation of NEs based on the main corpus.
4.1</p>
      <sec id="sec-4-1">
        <title>NER on modernized samples</title>
        <p>Looking at the annotations generated by each system, it becomes clear that
BERT-CRF had a better precision for the annotation of NE tags. Some recurrent
mistakes of this system were the annotation of locations as persons and also the
annotation of any digit as a value. It also missed some values when the number
was written in its expanded form (e.g., “150.000 eri´s” was correctly recognized,
but “um conto de eri´s” [1.000 eri´s] was not). Regarding both spaCy models, the
main errors concentrate in the incorrect annotation of tokens as NE, and also in
the annotation of persons as locations (the opposite of BERT-CRF model).</p>
        <p>When looking at the annotations across the three text samples, it became
clear that the sermon by Fr. Vieira was the most complex for the spaCy systems
to properly tag the instances, as many persons were wrongly tagged (16 for
spaCy lg and 20 for spaCy sm out of 39). However, it was Semedo’s text the
most complex in terms of proper recognition. The whole text had only 14 NEs
in the gold standard, and BERT-CRF incorrectly identified 6 extra NEs, while
spaCy lg got 14 extra NEs, and spaCy sm got 18 extra NEs, which is more than
the number of correct NEs existing in the text. In the Gazetas, which had a total
of 171 NEs, all systems worked fairly well, but BERT-CRF again showed the
best performance, with 143 fully correct NEs, 14 wrong tags, and only 4 missing
NEs. The spaCy lg model had 127 correctly annotated NEs, 17 wrong tags, and
14 missing annotations; while spaCy sm had the worst result, with 105 correct
annotations and 42 wrong tags.</p>
        <p>Another error that was common for the spaCy models (especially spaCy sm)
was the tagging of extra tokens preceding or following a NE. When there was
a capitalized word in the proximity, it was often considered as part of the NE,
leading to the extraction of NEs such as “Batizou Francisco de Almada”
[Baptized Francisco de Almada] and “O Marquˆes de Marialva” [The Marquess of
Marialva]11. This was not a problem at all for BERT-CRF.</p>
        <p>BERT-CRF had the best result when analyzing the extraction from a
modernized version of the texts. However, since the process of modernizing these texts
is similar to the process of a translation, it is unrealistic to expect all historical
texts to be translated before applying a NER system to them. So we cannot use
an NE extraction based on modernized versions as a parameter for old texts.
As such, we still had to see how NER systems would work in a non-modernized
version. This is what we explore in the next section.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>NER systems: modernized vs. non-modernized extraction</title>
        <p>In this section, we describe the results of our intra-system comparison. This was
done to check how the results difered when the sample text varied, and the
non-modernized transcription was used for the extraction of NEs.</p>
        <p>Full Diferent Partial Extra Missing
match tag match instances instances
BERT-CRF 221 (83.71%) 24 (9.09%) 13 (4.92%) 6 (2.27%) 14
spaCy lg 189 (53.54%) 32 (9.07%) 24 (6.80%) 108 (30.59%) 15
spaCy sm 195 (63.73%) 45 (14.71%) 18 (5.88%) 48 (15.69%) 17</p>
        <p>Table 3 shows how the systems compare among themselves when ran on
modernized and on non-modernized versions of the same text. The percentages
11 In the evaluation, these were considered as partial annotations.
in brackets are a direct comparison between the annotated NEs in the
nonmodernized versions against the ones annotated in the modernized version (e.g.
the 221 full-match instances were present in both BERT-CRF annotations with
the same tag — albeit with diferent spellings — and they account for 83.71% of
the annotations in the non modernized version). The missing annotation are the
NEs in the modernized version that where not annotated in the non-modernized
version. While here there was no judgment in terms of the extracted NEs
themselves being correct or not, it was possible to see some interesting annotations.
For instance, in the results from BERT-CRF, where the annotation of the
nonmodernized version came up with a few more accurate results (e.g. “senhora
Condessa da Atalaja D . Francisca” was a partial, and “senhora Condessa de
Arcos” was not present in the annotation of the modernized text). For the spaCy
models, however, we see that too many extra instances were added to the
annotation, and many of these extra instances were wrong, while there was a larger
number of missing instances.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Annotation of Semedo’s Work</title>
        <p>From a qualitative point of view, the annotation on the non-modernized
transcriptions was not as good as in the modernized versions, which was expected.
In the non-modernized versions, there are spelling issues that introduce noise
in the results. However, even in such adverse context, BERT-CRF annotations
were still consistent, and it proved to be the most robust of the three models,
as it was able to handle one of the main issues of working with old texts: the
non-standard spelling.</p>
        <p>
          Following our objective of retrieving information from historical medical
texts, we automatically annotated NEs in Semedo’s Observac¸oens medicas
doutrinaes de cem casos gravissimos [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The annotation contains the following
distribution of tags (number of unique forms in brackets): ABSTRACTION: 335 (135);
EVENT: 1 (1); LOCATION: 291 (184); ORGANIZATION: 31 (26); OTHER: 18
(13); PERSON 1326 (692); THING: 110 (54); TIME: 71 (62); VALUE: 293 (70);
WORK: 30 (24); total: 2506 (1261)12.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Final Remarks</title>
      <p>We compared the performance of three of-the-shelf NER systems in Portuguese
texts written in the XVII and XVIII centuries. We collected three modernized
text samples from that period, annotated them with NE tags to create a gold
standard, and evaluated the extractions of the three systems against this gold
standard. We also compared the extractions from the modernized versions of the
samples against their non-modernized versions to see how much the diferences
in spelling and formatting would interfere with the systems’ performance.
12 This annotation is readily available at
https://github.com/uebelsetzer/NER_for_ Portuguese_XVIII-Century_Texts .</p>
      <p>After analyzing the results of both experiments, we concluded that
BERTCRF had better performance, even when considering the original spelling of the
historical texts. Both spaCy models had issues in recognizing NEs, changing the
tag of many entities and adding wrong NEs, especially in the non-modernized
versions of the texts. Considering these results, we used BERT-CRF to
annotate a large sample of non-modernized texts extracted from Semedo’s work
Observac¸oens medicas doutrinaes de cem casos gravissimos .</p>
      <p>In future research, we intend to annotate this corpus with other types of
entities that are relevant for understanding the medical practices of the
XVIIXVIII century. Our plan is to then train a system to identify these entities and
extract more information from similar texts from that period.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>The authors would like to thank the following institutions for providing funding
for this research: Expanding Excellence in England (E3) Fund, promoted by
Research England; CNPq and FAPERGS - Brazil (FAPERGS - CAPES - 06/2018
internacionalizac¸˜ao - proc. 19/2551-0000718-3; CNPq 06/2019 - Productivity in
research – proc. 308926/2019-6); and the Portuguese Foundation for Science and
Technology (FCT), projects CEECIND/01997/2017 and UIDB/00057/2020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aguilar</surname>
            ,
            <given-names>S.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stutzmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Named entity recognition for french medieval charters</article-title>
          .
          <source>In: Workshop on Natural Language Processing for Digital Humanities</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Golub</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Y.H.</given-names>
          </string-name>
          :
          <article-title>Information and knowledge organisation in digital humanities: Global perspectives (</article-title>
          <year>2022</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Higuchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cuconato</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rademaker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Text mining for history: ifrst steps on building a large dataset</article-title>
          .
          <source>In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC</source>
          <year>2018</year>
          )
          <article-title>(</article-title>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Hubkova´,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Kral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Pettersson</surname>
          </string-name>
          , E.:
          <article-title>Czech historical named entity corpus v 1.0</article-title>
          .
          <source>In: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          . pp.
          <fpage>4458</fpage>
          -
          <lpage>4465</lpage>
          . European Language Resources Association, Marseille, France (May
          <year>2020</year>
          ), https://www.aclweb.org/anthology/2020.lrec-
          <volume>1</volume>
          .
          <fpage>549</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lisboa</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          , dos Reis Miranda,
          <string-name>
            <given-names>T.C.</given-names>
            ,
            <surname>Olival</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Gazetas Manuscritas da Biblioteca Pu´blica de E´vora</article-title>
          . Vol.
          <volume>1</volume>
          (
          <issue>1729</issue>
          -
          <fpage>1731</fpage>
          ).
          <source>Publicaco˜¸es do Cidehus</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Nothman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ringland</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Curran</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          :
          <article-title>Learning multilingual named entity recognition from wikipedia</article-title>
          .
          <source>Artificial Intelligence</source>
          <volume>194</volume>
          ,
          <fpage>151</fpage>
          -
          <lpage>175</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Quaresma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finatto</surname>
            ,
            <given-names>M.J.B.</given-names>
          </string-name>
          :
          <article-title>Information extraction from historical texts: a case study</article-title>
          .
          <source>In: DHandNLP@ PROPOR</source>
          . pp.
          <fpage>49</fpage>
          -
          <lpage>56</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seco</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cardoso</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vilela</surname>
          </string-name>
          , R.:
          <article-title>Harem: An advanced ner evaluation contest for portuguese</article-title>
          . In: quot; In Nicoletta Calzolari; Khalid Choukri; Aldo Gangemi; Bente Maegaard; Joseph Mariani; Jan Odjik; Daniel Tapias (ed)
          <source>Proceedings of the 5 th International Conference on Language Resources and Evaluation</source>
          (LREC'
          <year>2006</year>
          )
          <article-title>(Genoa Italy 22-</article-title>
          28 May
          <year>2006</year>
          ) (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Consoli</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>dos Santos</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Terra</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Collonini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vieira</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Assessing the impact of contextual embeddings for portuguese named entity recognition</article-title>
          .
          <source>In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS)</source>
          . pp.
          <fpage>437</fpage>
          -
          <lpage>442</lpage>
          . IEEE (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Schiel</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Texto &amp; contexto: por uma recuperaca˜¸o da informaca˜¸o com mais semaˆntica</article-title>
          .
          <source>Ciˆencia da Informaca˜¸o 50(2)</source>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Semmedo</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          :
          <article-title>Observaco¸ens medicas doutrinaes de cem casos gravissimos (1707)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. de Sousa,
          <string-name>
            <surname>M.C.P.:</surname>
          </string-name>
          <article-title>O corpus tycho brahe: contribucio¸˜es para as humanidades digitais no brasil. Filologia e Linguıs´tica Portuguesa 16(esp</article-title>
          .),
          <fpage>53</fpage>
          -
          <lpage>93</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nogueira</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lotufo</surname>
          </string-name>
          , R.:
          <article-title>Portuguese named entity recognition using bert-crf</article-title>
          . arXiv preprint arXiv:
          <year>1909</year>
          .
          <volume>10649</volume>
          (
          <year>2019</year>
          ), http://arxiv.org/abs/
          <year>1909</year>
          . 10649
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nogueira</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lotufo</surname>
          </string-name>
          , R.:
          <article-title>BERTimbau: pretrained BERT models for Brazilian Portuguese</article-title>
          .
          <source>In: 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October</source>
          <volume>20</volume>
          -23 (to appear) (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Vieira</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olival</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cameron</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sequeira</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Enriching the 1758 portuguese parish memories (alentejo) with named entities</article-title>
          .
          <source>Journal of Open Humanities Data</source>
          <volume>7</volume>
          ,
          <issue>20</issue>
          (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Webb</surname>
          </string-name>
          , J.:
          <article-title>Historical epidemiology and global health history</article-title>
          . Hisotr´ia, Ciˆencias, Sua´de-Manguinhos
          <volume>27</volume>
          ,
          <fpage>13</fpage>
          -
          <lpage>28</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>