Named entity annotation of an 18th-century transcribed corpus: problems and challenges1 Helena Freire Cameron1 [0000-0001-7719-6894], Fernanda Olival2 [0000-0003-4762-3451], Renata Vieira2 [0000-0003-2449-5477], and Joaquim Santos3 [0000-0002-0581-4092] 1 CIDEHUS, Instituto Politécnico de Portalegre, helenac@ipportalegre.pt, 2 CIDEHUS, Universidade de Évora, mfo@uevora.pt, renatav@uevora.pt 3 Escola Politécnica, PUCRS joaquimneto04@gmail.com Abstract. This paper reviews a stage of the process of annotating named entities in 18th-century texts to enrich historical research sources and link them to other bases. The categories in question are person, location and organisation, valid categories for historian analysis. We discuss the difficulties observed in the process and point eventual solutions. Keywords: Named entity, corpus annotation, 18th-century Portuguese. 1 Introduction This paper describes a stage of the process of annotating named entities in 18th-century texts from a transcribed corpus, Memórias Paroquiais [Parish Memories]. We studied a subcorpus regarding the biggest region in the south part of Continental Portugal, Alentejo (PM_A). These texts are the answers to a survey sent in January of 1758 to the bishops asking them to resend it to the parish priests to respond, aiming: 1) to obtain feedback about the state of the territory after the big earthquake of 1755; 2) to gather information to create a Geographical Dictionary of Portugal. The inquiry, equal for all the country, had 60 questions, organised into three points: land, mountain and river. It asked about almost everything locally and it finished with an open question about other pertinent topics of the parish not examined in the previous questions. We began working in the Parish Memories in 2007-2008, using the digitised copies of the microfilms available on the Portuguese National Archive of Torre do Tombo website (Figure 1). At first, we just had the goal of transcribing them and posting the results on the web. Next, we tried to make the texts more easily 1 Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This work is funded by national funds through the Foundation for Science and Technology, under the project UIDB/00057/2020 2 searchable, introducing a manual system of tags and making it available online in the CIDEHUS Digital Web Portal2. Later on (2020), other possibilities opened up; annotating named entities was one way to advance to information extraction. We aimed to organise the information contained in the corpus and also make it available as open data and open linked data. Our first goal is to link data from the Parish Memories to data from the book Corografia Portuguesa (Costa, 1712), also online in the CIDEHUS Digital3. As the process of annotation proceeded, many questions arose. We systematised some of them in the present paper to inventory and discuss the difficulties of annotating 18th-century Portuguese historical texts and to reflect on the next annotation phase. Fig. 1. Sample of the digital version of the Parish Memories, available at Torre do Tombo website4. 2 http://www.cidehusdigital.uevora.pt/portugal1758 3 http://www.cidehusdigital.uevora.pt/ophir-restaurada/corografia 4 https://digitarq.arquivos.pt/details?id=4238720 3 2 Named Entity manual annotation Our annotation followed the universal categories in NER works: person, location and organisation. These categories are also of significant importance for history research. Historical events are often linked to agents, places and institutions. Many systems were developed to recognise these categories, so we selected them for our first annotation attempts considering the 18th-century texts. We followed guidelines previously proposed by the HAREM project (Santos and Cardoso, 2007), a named entity recognition contest based on contemporary Portuguese. However, we have not yet distinguished different types of mentions. In HAREM subtypes, a person was also marked as being an individual with other singular attributes (like occupation), as in Elisabeth II or the Queen of the United Kingdom, respectively. Capital letters constituted a criterion to identify named entities. Although the use of capital letters is not systematic in PM_A (they were used randomly in common nouns but correctly in the majority of named entities), we followed this directive. Two historians annotated a representative sample of the Memories manually, using the Inception tool (Klie et al., 2018). This annotation platform includes semantic annotation (e.g., concept linking, knowledge base population), essential features for History researchers. The annotation interface is presented in Figure 2. Fig 2. INCEpTION interface for named entity annotation. 4 The annotation agreement was measured, resulting in Kappa = 0.71, considered a substantial agreement (McHugh, 2012). Table 1 shows the number of annotated examples for each class. Table 1: Number of NE manually annotated Category Annot1 Annot2 PER 491 474 LOC 693 514 ORG 444 368 Total 1628 1356 Fig. 3. Examples of named entities In Figure 3, we show some examples of each category occurring in the Parish Memories, the text also illustrates some spelling differences from that period regarding the contemporary stage of the Portuguese language (naquellas, espeçialmente, estavâo, villa, fes). The Portuguese spelling standard was still not fixed in the 18th-century, and many allographs can occur in the same document. There were rules, but nobody was obliged to follow them (Cardeira, 2006; Cardeira & Mateus, 2008; Marquilhas, 2000). Variance in MP depends upon not only linguistic phenomena like spelling or others but also the ability of the priest and, eventually, the interpretation of the transcriber, as we are dealing with transcribed texts from handwritten sources. The manual annotation served two purposes, to evaluate available named entity recognition systems applied to this kind of text (Vieira et al, 2021) and to prepare for a next phase of annotation of a larger portion of the corpus to adapt the current system and improve their performance. The preparation for the next phase includes the creation of new annotation guidelines, and in order to advance towards this goal we reflected on the problems and challenges, as we may see further below. 5 3 Problems and challenges of an 18th-century corpus To work with 18th-century sources is quite different from dealing with nowadays written registers. In this paper, we describe some of the problems and challenges encountered during the annotation. We organised them into a few categories. Morphology and Spelling: Variation is a challenge for the 18th century Portuguese language’s automatic processing of texts as most existing tools operate in the contemporary Portuguese language (Cameron et al., 2020). In the 18th-century, variation could succeed due to linguistic issues and spelling variation. Also, the graphic registration of uppercase and lowercase letters was not consistent: they were used randomly on the original, and some transcribers kept the randomness; others interpreted it and updated it to current Portuguese. Capital letters are usually relevant for NER identification, but their presence may be less consistent in this kind of corpus. Regarding punctuation marks, they do not follow the current pattern precisely. This variance in this stage of the language is a constraint not only for automatic machine processing but also for human readers, even if they are language experts, and in minor questions. For example, regarding the allographs s/ç in the initial position of the word, in Çamora/Samora [Samora Correia, name of a parish], the form used is Çamora, and it is inserted in the alphabetic sequence beginning with C included in the last volume of Memórias Paroquiais, what may trouble to find the document. Observations about each category: PER: The 18th-century Portuguese society is highly hierarchical, where social category defines people. For that, the so-called pre-nouns are very important. Sometimes they are mixed with forms of treatment or protocol forms. They can help identify the social category or the occupation, as in the following example. 1. Exmo. e Ilustríssimo Senhor D. Manuel da Gama, Vice-Rei da Índia In the annotation process, if we annotate the person’s name without the pre-nouns, we lose essential information about that person. ORG: The distinction between physical localities from political entities is often subtle in these texts, wherein the same name can be a Local or an Organisation (Álvarez-Mellado et al., 2021), as in: 2. Esta Villa de Amieyra fica na Provincia do Alemtejo pertence ao Gram Priorado do Crato, de que hê cabeça a Villa do Crato. 6 LOC: The entity Loc refers to specific places. However, sometimes the parishes do not specify the local but they indicate a general location, as in: 3. Esta Vila está a meio da encosta “meio da encosta” is not a specific location, so it would not be a LOC entity. The following issues are not particular to 18th-century texts but are general problems for NER that may affect the annotation process. Discontinuity: The annotation process usually considers the named entity in a continuous sequence. However, in texts, names may appear in a reduced form or separated in the discourse. In the sequence below, “Irmandade de Nossa Senhora do Rozario” is registered in the text as “Nossa Senhora do Rozario”, and the human reader understands that this name is related to “Irmandades”.The entity itself should be named Irmandade de Nossa Senhora do Rozario, where, in the text, “irmandade” is written in the plural, as it introduces an enumeration of organisations of this type. 4. As Irmandades que existem, dentro della [Vila], são sinco, a saber, a do Sanctissimo Sacramento a que esta anexa a das Quarentas Horas; a de Nossa Senhora do Rozario; a de Nossa Senhora da Graça, a das Almas, a do Appostolo Sam Pedro advincula, administrada pelo clero desta Villa. A special attention must be given to the surrounding discourse context to consider Nossa Senhora do Rozario, Appostolo or Sam Pedro advincula rightfully as Organizations and not Persons. Embedded entities: The great majority of work in NER does not consider embedded entities. One must choose between annotating constituents or the more extensive sequence. In the following examples, we can see a mixture of entities. Whereas it is a general assumption to consider the complete mention, in (5), we can identify both the college name and the university name. In (6), we have a person's name embedded in the organisation’s name. 5. Collegio de Sao Paullo da Universidade de Coimbra 6. Companhia da Infante Dona Brites Pereira Saboya 4 Next steps Although the basic categories, Person, Location and Organisation, are relevant, there are significant improvements to incorporate to better capture or reproduce the society of the time and its structural differentiation. Inspired by the experience 7 of prosopographical datasets regarding persons, we consider adopting the subcategories name, occupation, and social category. A proper name may refer to a person, but also, in some cases, references are made or composed by distinct occupations or social status that constitute essential information in historical analysis, as briefly mentioned above. To deal with the relation between persons (family, patronage relations and others) is an issue to be developed shortly. Location and organisation are challenging to distinguish when geopolitical entities are involved, such as names of countries, states and cities. Therefore, in some previous work, we adopted the GPE category to avoid this ambiguity problem. Besides GPE, we may find other Location and Organisation subcategories such as rivers, mountains for places and churches for organisations. Also, time is another crucial tag missing. To better describe the work to be done and prevent ambiguities in annotation when using several annotators, we are now working on new guidelines adjusted to the 18th-century Portuguese. Annotating is a way of enriching a corpus, and this effort must handle the past complexity and not misrepresent it with anachronistic simplifications. It is essential to look at a word in the economy of the text and to look at the text in its global context. In this sense, the presence of a social historian is significant in a team dealing with distant past corpora, be they composed by literary texts or administrative ones. Once we have the new annotated corpus, we will retrain existing NER models to annotate this kind of corpus for history research automatically. References 1. Álvarez-Mellado, E., Díez-Platas, M. L., Ruiz-Fabo, P., Bermúdez, H., Ros, S., González-Blanco, E. (2021) TEI-friendly annotation scheme for medieval named entities: a case on a Spanish medieval corpus», Language Resources and Evaluation, vol.55, no 2, 525–549. 2. Cardeira, E. (2006) O essencial sobre a História do Português. Alfragide: Editorial Caminho. 3. Cardeira, E., Mateus, M. H. (2008). Norma e Variação. Alfragide: Editorial Caminho. 4. Cameron, H. F., Gonçalves, M. F., & Quaresma, P. (2020). Linguistic and orthographical classic Portuguese variants challenges for NLP. Proceedings of the 14th International Conference on the Computational Processing of Portuguese, (pp. 43–48). 5. Costa, A. C. (1712). Corografia Portugueza e descripçam topografica do famoso reyno de Portugal: com as noticias das fundações das cidades, villas, & lugares, que contem, varões illustres, genealogias das familias nobres, fundações de conventos, catalogos dos bispos, antiguidades, maravilhas de natureza, edificios & outras curiosas observaçoens. Lisboa: na officina de Valentim da Costa Deslandes, Tomo primeyro [-terceyro], vol. 1-2-3. 6. Klie, J.-C., Bugert, M., Boullosa, B., de Castilho, R. E., and Gurevych, I. (2018). The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational 8 Linguistics: System Demonstrations, pages 5–9. Association for Computational Linguistics 7. Marquilhas, R. (2000). A faculdade das letras: leitura e escrita em Portugal no séc. XVII. Lisboa: IN-CM. 8. McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia Medica, 22(3), 276–282. 9. Santos, D. & Cardoso, N. (edits) (2007). Reconhecimento de entidades mencionadas em português: Documentação e actas do HAREM, a primeira avaliação conjunta na área. Linguateca: digital print. 10. Vieira, R., Olival, F., Cameron, H., Santos, J., Sequeira, O., Santos, I. (2021): Enriching the 1758 portuguese parish memories (alentejo) with named entities. Journal of Open Humanities Data 7, 20.