Extracting Entities and Events from Archives
                 Textual Metadata?

              Davide Varagnolo1 , Cássio Rodrigues1 , Ana Martins1 , Dora
                   Melo2,3[0000 0003 3744 2980] , and Irene Pimenta
                          Rodrigues1,3[0000 0003 2370 3019]
          1
            Department of Informatics, University of Évora, Portugal
    2
   Coimbra Business School—ISCAC, Polytechnic Institute of Coimbra, Portugal
3
  NOVA Laboratory for Computer Science and Informatics, NOVA LINCS , Portugal
      d.varagnolo@studenti.unipi.it, cassiorodrigues@outlook.com,
     anacatarinasmartins@hotmail.com, dmelo@iscac.pt, ipr@uevora.pt


         Abstract. A method for extracting events and entities from ISAD(G)
         metadata archives elements, that contain text descriptions, is presented.
         The method consists of applying a set of processing rules according to
         text fields classification, with the objective of populating the CIDOC-
         CRM ontology with additional information about events and entities, like
         persons, locations, dates, relations, baptisms, births, etc. An illustrative
         example of a ’baptism’ classification, extraction, and representation is
         also presented.

         Keywords: Text Classification · Natural Language Processing · Seman-
         tic Web · Archives Linked Data Semantic Representation.


1       Introduction
EPISA (Entity and Property Inference for Semantic Archives) is a research
project involving the Portuguese National Archives - Torre do Tombo, archival
experts, and Information and Computer Science researchers The project aims to
design a prototype, as an open-source knowledge platform, aiming to represent
archival information on a linked data model. One of the project’s major tasks is
the semantic migration, i.e, the process to extract and represent the relevant en-
tities and their properties from the existing records in the actual DigitArq, [13].
The DigitArq platform is the Portuguese National archive system that uses well-
established description standards, namely the ISAD(G) (General International
Standard Archival Description) [3] and ISAAR(CPF) (International Standard
Archival Authority Record for Corporate Bodies, Persons and Families) [14]
with a hierarchical structure adapted to the nature of archival assets.
    To accomplish the migration process an automatic semantic migration pro-
totype, based on Knowledge Discovery, from Digital Archive metadata to pop-
ulate an ontology in CIDOC-CRM was developed. CIDOC-CRM (Conceptual
?
    This work is financed by National Funds through FCT - Foundation for Science and
    Technology I.P., within the scope of the EPISA project - DSAIPA/DS/0023/2018.
    Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
    International (CC BY 4.0).
        D. Varagnolo et al.


        Fig. 1. Architecture of the Events and Entities Ontology Population.


Reference Model) standard, an ontology developed for museums by the Inter-
national Committee for Documentation (CIDOC) of the International Council
of Museums (ICOM) [10, 2], was used to build the data model and description
vocabularies [4, 11].
    The resulting dataset is an OWL knowledge base representation of the ex-
isting information in DigitArq as it is, where each DigitArq representation of
metadata archives units has a scheme complying to ISAD(G) [3] and ISAAR
[14] recommendations. The DigitArq information is organized according to a set
of fields and their values. Among this set of fields, there are some that present
atomic values, such as the ”Reference code”, the ”Title”, or the ”Recipient”,
that do not require further interpretation, and the migration process was done
by applying a predefined set of rules establishing the mapping between ISAD(G)
elements and CIDOC-CRM classes and properties. There are other fields, such as
’Scope and content’ and Archival and Custodial History that are characterized
by having additional information describing its unit, and it is in text format.
These texts, usually, have a structure that can be recognized, by using Natural
Language Processing (NLP) tools, and giving as output a feature value list that
will be the input for the additional ontology Population.
    The information extraction from text is not intended to extract all the infor-
mation, but only parts of information considered important, such as baptisms,
births, inventories due to death, incorporation of documents between archives,
institutions, persons, and places involved in those events.
    Methodologies to extracted general information from text into ontologies are
presented in several works, such as [8, 1, 9, 12, 7, 5, 6]. In particular, OntoPrima
is a NLP-based Ontology Population system that extracts instances of concepts
and relations from text to populate an ontology using NLP techniques.
    The goal of this paper is to present the process of applying a set of processing
rules according to text fields classification, with the objective of populating the
ontology with additional information about events and entities, like persons,
locations, dates, relations, baptisms, births, etc.
            Extracting Entities and Events from Archives Textual Metadata

2   Events and Entities extraction overview
The extraction of information from documents text elements depends on the
type of the information, such as baptisms, births, inventories due to death, in-
corporation of documents in archives, institutions, persons, dates and places
involved in those events. The ontology representation of this information and
the corresponding mapping rules are defined manually and are presented in the
next subsection.
    Each type of information and extraction process is defined using GATE
(https://gate.ac.uk/), and to decide the type of information in a text linked
to a document, an automatic text classifier is used.
    Figure 1 presents the architecture of the proposed system that has 3 phases.
The first one runs a classifier over the national archives information OWL rep-
resentation. In the second phase, for each classified text linked to a document
(E31 Document), the text and the corresponding document reference code are
sent to an information extraction process. Finally, the information extraction
process extracts a set of relations, that using the OWL rewriting rules (mapping
rules), will represent the information extracted in CIDOC-CRM linked to the
document that had the information extracted.


           Fig. 2. CIDOC-CRM Representation of a Baptism Activity.
        D. Varagnolo et al.

2.1   Representation of events and entities extracted

Each classified text has a set of mapping rules assigned to it, which allows to
represent the information extracted in CIDOC-CRM. For better understanding,
consider in particular the ’baptism’ type of information.
    The representation of a baptism in CIDOC-CRM is presented in Figure 2. A
’baptism’ is represented as a CIDOC-CRM activity (E7 activity) linked to the
document representation by property ’P129 is about’. The ’E7 activity’ has type
’baptism’ and can have: a date, the person baptized that is the document receipt,
and the persons and their roles in the baptism context like the Godmother and
the Godfather. In a baptism description the birth is also described, the birth is
represented by the CIDOC-CRM class birth (E67 Birth) linked to the document
by the property ’P67 refers to’. The birth class has properties to represent the
parents, ’P96 by mother’ and ’P97 by father’. To represent the grandparents, new
birth events are used to represent the birth of the parents. All these births are
linked to the document with the text description by property ’P67 refers to’.
    This representation is automatically generated from the information extracted
from the text and the document reference code.


2.2   Text metadata automatic classification

To build an automatic classifier to determine if a natural language text contains
some information, such as the description of a baptism, a marriage, a passport
request, some material transference between archives, or an enumeration of enti-
ties, a sample of Fonds with its hierarchical dependent documents was chosen and
semi-manually tagged. The classified texts were then used to build the datasets
for the automatic classifiers.


Corpus for the Classifications A set of 3,800 Portuguese National Archives
documents was selected from four district archives, and for each document the
information was represented as showed in Table 1 when adequate.
    The text values of the ISAD(G) elements, such as ”Scope and content” or
”Archival and Custodial History”, from the sample is used to define the auto-
matic text classifier that will be applied to decide if a text, defined as a string
and linked to a document in the Archives OWL2 representation, should be used
to extract events and entities to populate the ontology.


Classifiers performance Since the text categories present in the selected
Fonds were unbalanced, from the 3,800 texts manually classified, a sample of
300 text for each category was selected. Then, Decisions Trees were used to
build a classifier for each information type. The dataset for each Decision Tree
was obtained by selecting 300 positive text occurrences and 300 negative occur-
rences selected from the other texts categories positive occurrences. The classifier
was built dividing the dataset intp 2 subsets, one for training (70%) and another
for testing (30%). The classifiers achieved 100% accuracy.
             Extracting Entities and Events from Archives Textual Metadata

                                Table 1. Examples

Classification    ISAD(G) Element     Text
Incorporation     Archival and Custo- ”Livros entrados no Arquivo por transferência
                  dial History        do Arquivo da Universidade de Coimbra, onde
                                      se encontravam provisoriamente em 1976, e
                                      por incorporações do Cartório Notarial de
                                      Oliveira de Azeméis.”
Incorporation     Archival and Custo- ”Terá sido transferida da Direcção de Fi-
                  dial History        nanças de Beja e/ou da Tesouraria da Fazenda
                                      Pública para o ADBeja em 1988, ao abrigo do
                                      DL n.º 46350, de 22 de Maio de 1965.”
Incorporation     Archival and Custo- ”Proveniente da Direcção Escolar de Beja, em
                  dial History        2001.”
Baptism           Scope and Content ”Pais: Manuel Martins Ramos e Ana Joaquina
                                      Martins
                                      Avos maternos: João Martins de Oliveira e
                                      Joana Francisca da Silva
                                      Avós paternos: Manuel Martins Ramos e
                                      Custódia Maria da Costa
                                      Padrinhos: António Martins de Oliveira e
                                      Josefina Maria de Jesus
                                      Data de nascimento: 9 de Novembro de 1811”


2.3   Events and entities extraction
The information extraction process is defined using GATE (General Architecture
for Text Engineering), a Java suite of tools to perform natural language process-
ing tasks over corpus. The tasks are managed by applications that include several
language resources. The main application is ANNIE (A Nearly-New Information
Extraction System), which is a set of modules comprising a tokenizer, a gazetteer,
a sentence splitter, a part-of-speech tagger, a named entities transducer and a
coreference tagger.
    Despite the fact that GATE does not support (natively) the Portuguese lan-
guage, the test results obtained in the information extraction tasks are very
satisfactory and promising.
    Each information extraction task is implemented by defining new rules for
two modules:

The Gazetteer (originally, geographical dictionary or directory used in con-
  junction with a map or atlas) are entity dictionaries used in Name Entity
  Recognition task. In the Annie application, these entity dictionaries help to
  tag di↵erent words in the texts.
The Named Entity Transducer is a technique of Name Entity Extraction
  via Finite State Transducer, managed by rules. In Annie application, it is
  possible to create and modify rules, using Jape (Java Annotation Pattern
  Engine, https://gate.ac.uk/wiki/jape-repository/). A Jape grammar consists
  of a set of phases, each of which consists of a set of pattern/action rules. The
        D. Varagnolo et al.

    phases run sequentially and constitute a cascade of finite state transducers
    over annotations. The left-hand-side (LHS) of the rules consist of an annota-
    tion pattern description. The right-hand-side (RHS) consists of annotation
    manipulation statements. Annotations matched on the LHS of a rule may
    be referred to on the RHS by means of labels that are attached to pattern
    elements.

     Consider, as an illustrative example, the text presented in Table 1 and clas-
sified as a ’baptism’. This text has a predefined structure:

 – All the relatives have the structure
    {DegreeOfKinship}: {MaleRelative} e {FemaleRelative}
 – The date of birth has a similar structure too, but checking the possible cases
   it has two variants:
    Data de nascimento: {DayInNumbers} de {NameOfMonth} de {YearInNumbers}
    Data de nascimento: {DayInNumbers}-{MonthInNumbers}-{YearInNumbers}

    This kind of structured text is adequate to define rules that helps to extract
entities. To achieve this purpose, two lists of names were created in the Gazetteer
to automatically tag the months and nouns designations of the degrees of kinship
in Portuguese (months pt.lst and relatives pt.lst). Portuguese proper names are
also extracted, but there is no need of applying named recognition techniques,
because the pattern of this kind of texts allows the extraction process to be carry
out directly with the Jape rules defined.
    After the Gazetteer processing, all the degrees of kinship and the months that
appears in the texts are tagged. Then and after the processing of the Sentence
Splitter and the POS Tagger, the Jape rules are applied to extract the name of
the relatives and the birthdates. Concerning the other text classification, spe-
cialized rules are defined to extract the information (in order to cover di↵erent
patterns), that have a priority system to be triggered. The following Jape rule
allows to extract the father in the baptism classification:

Rule: Pais Priority: 100
({Token.string ==~ "[Pp]ais"} {Token.string == ":"} ):intro
({Person.kind == fullName} ):pai
({Token.string == "e"}|({Token.string == "e"} {Token.string == "de"})
):and
({Person.kind == fullName, Token.string != "Avos", Token.string != "Data"}):mae
-->
:pai.Relative = {kind = "pai" }, :mae.Relative = {kind = "mae" }


    Finally, the output of the each extraction task is a XML file with all the
information extracted. This file is the input of the ’Rewriting Rules’ module
that will update the knowledge base with the new individuals, class instances,
and properties that link them together and represent the information extracted
in CIDOC-CRM. These new individuals are also linked to the document where
they were mentioned.
             Extracting Entities and Events from Archives Textual Metadata

Information Extraction Performance These information extraction spe-
cialized processes are still under development. However, preliminary evaluation
of the process for baptisms descriptions can achieve a 98% precision and 99%
recall, when extracting dates, entities, birth events and baptism activities with
the correct roles assigned to the persons who participated in the events, as well
as the places. Similar results on precision and recall were obtained for processes
to extract the information from very well structured texts, but the experiments
with texts classified as incorporation have a much lower precision and recall,
due to the fact that those texts are not so well structured. These kind of texts
require the use of syntactic information and semantic similarity to recognize the
events and the roles in the events.
    The experiments to evaluate the precision and recall of the extraction pro-
cesses are done by choosing a set of documents, where an ISAD(G) element is
classified as the process category and for each text a human decides if the infor-
mation extracted is correct and if all the information was extracted. In a near
future an improvement of this evaluation will be done by trying to automatize
some parts of the human verification.


3   Conclusions and Future work
A method for extracting information from ISAD(G) elements, that contain text
descriptions, was proposed in this paper. The method has 5 phases: the defi-
nition of the information to be represented, the definition of the rules to map
the information in CIDOC-CRM, a classifier for selecting the texts that con-
vey the information, a process for automatic extract the information, and its
representation CIDOC-CMR using the mapping rules.
    The experiments made in extracting events and entities from text documents,
for some types of information, show that the proposed strategy is adequate even
if some processes such as the extraction of incorporation information need to
be improved. The classifiers evaluation shows that it is easy to build a classifier
with a very good performance to classify the text in the text elements. The task
of manually annotate the set of documents for the classifier is laborious and
time consuming since it has to be done for at least a sample of 500 positive
occurrences of texts for each information type.
    The evaluation of the final results is done by evaluating the correctness of
the representation of the events and entities in the final Ontology, which is
also laborious and time consuming and requires specialized knowledge on the
ontology representation. As future work, it is intended to improve the evaluation
process.


References
 1. di Buono, M.P., Monteleone, M., Elia, A.: How to populate ontologies. In: Métais,
    E., Roche, M., Teisseire, M. (eds.) Natural Language Processing and Information
    Systems. pp. 55–58. Springer International Publishing, Cham (2014)
        D. Varagnolo et al.

 2. ICOM/CIDOC: Definition of the CIDOC Conceptual Reference Model.
    ICOM/CRM Special Interest Group, 7.0.1 edn. (October 2020)
 3. International Council on Archives: ISAD(G): general international standard
    archival description, Second Edition. Springer Nature BV (2011)
 4. Koch, I., Freitas, N., Ribeiro, C., Lopes, C.T., da Silva, J.R.: Knowledge graph
    implementation of archival descriptions through cidoc-crm. In: Doucet, A., Isaac,
    A., Golub, K., Aalberg, T., Jatowt, A. (eds.) Digital Libraries for Open Knowledge.
    pp. 99–106. Springer International Publishing, Cham (2019)
 5. Kordjamshidi, P., Moens, M.F.: Global machine learning for spatial ontology pop-
    ulation. Journal of Web Semantics 30, 3–21 (2015)
 6. Leshcheva, I., Begler, A.: A method of semi-automated ontology population from
    multiple semi-structured data sources. Journal of Information Science 0(0) (2020)
 7. Lubani, M., Noah, S.A.M., Mahmud, R.: Ontology population: Approaches and
    design aspects. Journal of Information Science 45(4), 502–515 (2019)
 8. Makki, J.: Ontoprima: A prototype for automating ontology population. Interna-
    tional Journal of Web/Semantic Technology (IJWesT) 8 (2017)
 9. Maynard, D., Li, Y., Peters, W.: Nlp techniques for term extraction and ontology
    population. (2008)
10. Meghini, C., Doerr, M.: A first-order logic expression of the cidoc conceptual ref-
    erence model. International Journal of Metadata, Semantics and Ontologies 13(2),
    131–149 (2018)
11. Melo, D., Rodrigues, I.P., Koch, I.: Knowledge discovery from isad, digital archive
    data, into archonto, a cidoc-crm based linked model. In: Proceedings of the 12th
    International Joint Conference on Knowledge Discovery, KEOD - Volume 2. pp.
    197–204. INSTICC, SciTePress (2020)
12. Petasis, G., Karkaletsis, V., Paliouras, G., Krithara, A., Zavitsanos, E.: Knowledge-
    Driven Multimedia Information Extraction and Ontology Evolution: Bridging the
    Semantic Gap. Springer Berlin Heidelberg, Berlin, Heidelberg (2011)
13. Ramalho, J.C., Ferreira, J.C.: Digitarq: creating and managing a digital archive.
    In: Building Digital Bridges: Linking Cultures, Commerce and Science: 8th
    ICCC/IFIP International Conference on Electronic Publishing held in Brası́lia -
    ELPUB 2004, Brazil, June, 2004. (2004)
14. Vitali, S.: Authority control of creators and the second edition of isaar (cpf), in-
    ternational standard archival authority record for corporate bodies, persons, and
    families. Cataloging & classification quarterly 38(3-4), 185–199 (2004)