<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>TPDL</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Support Archival Description (short paper)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mariana Dias</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carla Teixeira Lopes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Cultural Heritage, Information Extraction, Optical Character Recognition</institution>
          ,
          <addr-line>Ontology Population, Semantic</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Engineering of the University of Porto and INESC-TEC</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>26</volume>
      <fpage>20</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and promote findability. The required detail in manual descriptions of cultural heritage objects can be taxing and time-consuming. Given this, in EPISA, a research project on this topic, we propose to use the contents of the digital representations associated with the objects to assist archivists in their description tasks. More specifically, to extract information from the digital representations useful for an initial ontology population that should be validated or edited by the archivist. We apply optical character recognition in an initial stage to convert the digital representation to a machine-readable format. We then use ontology-oriented programming to identify and instantiate ontology concepts using neural networks and contextual embeddings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Cultural heritage institutions have the mission of protecting, valuing, and sharing national
inheritances. Linked Data has provided the possibility to improve the quality of archival
descriptions and promote access to enriched cultural archives by providing users with a more
in-depth knowledge of collections. However, manually describing cultural heritage objects can
be taxing and time-consuming, making it challenging to describe documents or collections in
ifner detail. The automatic extraction of information from digital representations relevant to
the archival description can ease the work of cultural heritage professionals.</p>
      <p>
        EPISA (Entity and Property Inference for Semantic Archives) is a research project that explores
the use of Linked Data in the context of the Portuguese National Archives. ArchOnto1 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a
linked data model for archives, was proposed in this project. This paper presents an overview
of what is being done in an EPISA task that aims to extract concepts and relations from digital
representations of archival records and map them to ArchOnto. This task’s final goal is to
provide these mappings as suggestions in the user interface to speed-up future descriptions.
It is not the purpose of this paper to go into detail about each stage of the process, including
CEUR
Workshop
Proceedings
their evaluation, which we will describe in other articles. Our goal with this paper is to report
the ongoing work supporting linked data archival description in a research project, share our
experience, and obtain feedback from others participating in the workshop.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Recently, research works in the cultural heritage field have used semantic representations of
cultural heritage objects to improve accessibility to the resources and information extraction
techniques to represent information related to the objects that sustain the knowledge bases.
Projects with the goal of extracting information from non-machine-readable historical
documents into ontologies are presented in several works [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]. Witte et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Vlachidis
and Tudhope [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] developed automatic ontology-based information extraction approaches based
on linguistic NLP techniques for a 19th-century encyclopedia of compiled architectural
knowledge and archaeological grey-literature reports, respectively. Packer and Embley [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] created
an automatic tool, ListReader, to extract information from lists in OCRed documents using a
wrapper induction technique that populates a user-defined ontology. Goy et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] presented
a proof-of-concept prototype [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] of a crowdsourcing platform using non-machine-readable
documents from the Istituto Gramsci dating from 1968 to 1969. Experts participate in the
semantic annotation process guided by the ontology and supported by suggestions provided by
automatic Information Extraction techniques.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>We propose an architecture divided into three modules: Optical Character Recognition (OCR),
Information Extraction (IE), and Ontology Population (OP). The execution pipeline is described
in Figure 1 that shows the relation of the project task in which we are working on Document
mining for automatic metadata records with the task Exploration and querying interface. Upon
a non-machine-readable archival digital representation uploaded to the user interface, the
OCR module extracts its textual content using a pre-processing image phase. The IE module
processes the textual content and uses a trained NER model to predict and annotate concepts.
The concepts are mapped to the ArchOnto ontology creating candidate concepts and relations
suggested as description values to the archivist. We detail each of these modules in the following
subsections.</p>
      <sec id="sec-3-1">
        <title>3.1. Optical Character Recognition</title>
        <p>
          The success of text recognition depends on the quality of digital representations, and it is
common for heritage documents to sufer some degree of degradation over time. From uneven
illumination to erased characters and angled digital representations, image processing methods
can be applied before the text recognition phase to improve the image quality and the overall
text extraction. We conducted an optimization experiment of diferent image algorithms’ and
parametrization using the OpenCV [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] library and a non-dominated sorting genetic algorithm
(NSGA-II) to determine the impact of image processing algorithms on the OCR performance.
Converting digital representations to a machine-readable format is executed with Tesseract [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>To better illustrate the process, we show an excerpt of the OCR output of a typewritten letter
in Figure 2.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Information Extraction</title>
        <p>This module firstly processes the textual content extracted from digital representations in the
OCR module. This pre-processing includes removing characters and punctuation introduced
by OCR that are not part of the original works, removing multiple empty lines, and rejoining
hyphenated words.</p>
        <p>
          We trained a NER model using a Bidirectional Long Short-Term Memory Network with
a Conditional Random Field Layer (BiLSTM-CRF) Neural Network model and pre-trained
contextual string embeddings by Santos et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. An annotated Portuguese dataset is necessary
for model training with the Sequence Tagger. However, there are no available annotated archival
collections in Portuguese and manually creating a large annotated corpus is time-consuming.
As an alternative, an available Portuguese corpus for named entities, the Second HAREM
collection [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], was pre-processed to suit the ArchOnto ontology’s classes. The collection
includes 1,040 documents with 7,847 named entities. Annotations that are not relevant for the
ontology population task were stripped from the annotated corpus, and suitable annotations
were mapped to ArchOnto classes. For instance, textual elements tagged with labels relating to
the concept of a person were mapped to a CIDOC-CRM person (class E21 Person), and elements
tagged with labels relating to the concept of the role of a person in an event were mapped to
the ArchOnto role (class ARE8 Role Type).
        </p>
        <p>
          We adopted the BIOES annotation scheme, a variant of the BIO [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] scheme, to label
multitoken named entities. Each token is firstly tagged with a label that represents if it is at the
beginning (B), inside (I), or end (E) of a named entity, if it is a single-token entity (S), or if it is
outside (O) of a named entity. The transformed corpus was used as training data. We trained
the final model on the entire collection with a split of 80% training data and 20% validation data.
        </p>
        <p>Taking the previous example of the OCR output of a typewritten letter from Figure 2 as an
example, the passage refers to the heading and closing of a letter that contains the recipients’
name and address, the date, and the sender’s name. Listing 1 contains the list of labeled concepts
identified in the letter.</p>
        <p>Edmundo &lt;B-E21&gt; Oliveira &lt;I-E21&gt; Orffo &lt;E-E21&gt;
Avenida &lt;B-E53&gt; D. &lt;I-E53&gt; Dinis &lt;E-E53&gt;
Lisboa &lt;S-E53&gt;
27 &lt;B-E52&gt; de &lt;I-E52&gt; Junho &lt;I-E52&gt; de &lt;I-E52&gt; 1961 &lt;E-E52&gt;
SECRETÁRIO &lt;S-ARE8&gt;</p>
        <p>Listing 1: Annotated concepts extracted from the typewritten letter presented in Figure 2.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Ontology Population</title>
        <p>
          After annotating the text extracted from the digital representations, we must map each
annotation to ArchOnto. The mapping and instantiation of concepts and relations were implemented
with the Python package for ontology-oriented programming OwlReady2 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Work on the
ontology rules is still ongoing with the instantiation of concepts and linking nominative relations.
With the concepts extracted in the IE module, Listing 2 presents the result of the instantiation.
        </p>
        <p>This paper presents an automatic approach to populate the ArchOnto ontology. However,
this approach can be generalized to other Linked Data models, such as RiC-O2 (Records in
Context Ontology), with the adaptation of mapping and instantiation of concepts and relations
of the ontology.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>
        For the evaluation of each of this work’s modules, we first created a dataset that contains
typewritten Portuguese documents from the 20th century [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Along each task, we used
Listing 2: Instantiation of concepts and relations from the annotation presented in Listing 1.
subsets of this dataset to suit the diferent experiments’ goals and requirements. For the OCR
module, we transcribed 708 typewritten digital representations extracted from the original
dataset. This new dataset [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] will be split in two, one for parameter optimization and another
for evaluation. The model trained in the IE module will be evaluated using the First HAREM
collection [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The corpus will be transformed similarly to the approach detailed with the
Second HAREM corpus and used as testing data. For the evaluation of the OP module, we will
ask two archivists to provide ArchOnto representations for a subset of thirteen aggregated digital
representations of archival records that were manually transcribed. From these descriptions, a
consensual representation will later be defined and used for a comparison with the output of
our automatic approach.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>This paper presented our approach for the automatic population of a domain-specific ontology
with information extracted from non-machine-readable digital representations of cultural
heritage documents. We will extract further information from the documents in the future to
create rich relations between the concepts identified in the NER module. Additionally, we will
evaluate each module to determine the quality of the results. We will also develop two APIs:
one for the OCR module that extracts the content of digital representations and another for
the OP module that populates the ArchOnto ontology given a textual file. Furthermore, we
will integrate the pipeline we developed into the EPISA interfaces by suggesting concepts and
relations when an archivist uploads a digital representation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work is financed by National Funds through FCT - Foundation for Science and Technology
I.P., within the scope of the EPISA project - DSAIPA/DS/0023/2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Koch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <article-title>Moving from ISAD(G) to a CIDOC-CRM-based Linked Data Model in the Portuguese Archives</article-title>
          ,
          <source>Journal on Computing and Cultural Heritage</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Witte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Krestel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kappler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Lockemann</surname>
          </string-name>
          ,
          <article-title>Converting a Historical Architecture Encyclopedia into a Semantic Knowledge Base</article-title>
          ,
          <source>IEEE Intelligent Systems</source>
          <volume>25</volume>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Packer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Embley</surname>
          </string-name>
          ,
          <article-title>Cost efective ontology population with data from lists in OCRed historical documents</article-title>
          ,
          <year>2013</year>
          , pp.
          <fpage>44</fpage>
          -
          <lpage>52</lpage>
          .
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 2 5 0 1 1 1 5 . 2 5 0 1 1 3 2 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tudhope</surname>
          </string-name>
          ,
          <article-title>A knowledge-based approach to information extraction for semantic interoperability in the archaeology domain</article-title>
          ,
          <source>Journal of the Association for Information Science and Technology</source>
          <volume>67</volume>
          (
          <year>2015</year>
          ).
          <source>doi:1 0 . 1 0 0 2 / a s i . 2 3</source>
          <volume>4 8 5 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Goy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Damiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. D.</given-names>
            <surname>Loreto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Musso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Radicioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Accornero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Colla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lieto</surname>
          </string-name>
          , E. Mensa,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rovera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Astrologo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Boniolo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>D'Ambrosio, PRiSMHA (Providing Rich Semantic Metadata for Historical Archives)</article-title>
          ,
          <source>in: JOWO</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Colla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leontino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Picardi</surname>
          </string-name>
          ,
          <article-title>Bringing semantics into historical archives with computer-aided rich metadata generation</article-title>
          ,
          <source>J. Comput. Cult. Herit</source>
          . (
          <year>2021</year>
          ). URL: https://doi.org/10.1145/3484398.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 4 8 4 3 9 8</volume>
          , just Accepted.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bradski</surname>
          </string-name>
          , The OpenCV Library, Dr.
          <source>Dobb's Journal of Software Tools</source>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>[8] GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository)</article-title>
          , https://github.com/tesseract-ocr/tesseract,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Consoli</surname>
          </string-name>
          , C. dos
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Terra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Collonini</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Vieira</surname>
          </string-name>
          ,
          <article-title>Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition</article-title>
          ,
          <source>in: Proceedings of the 8th Brazilian Conference on Intelligent Systems</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>437</fpage>
          -
          <lpage>442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Freitas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. G.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          , P. Carvalho,
          <string-name>
            <surname>Second</surname>
            <given-names>HAREM</given-names>
          </string-name>
          :
          <article-title>Advancing the State of the Art of Named Entity Recognition in Portuguese</article-title>
          ,
          <source>in: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          , Valletta, Malta,
          <year>2010</year>
          . URL: http://www.lrecconf.org/proceedings/lrec2010/pdf/412_Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Ramshaw</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          ,
          <article-title>Text Chunking using Transformation-Based Learning</article-title>
          ,
          <source>CoRR cmp-lg/9505040</source>
          (
          <year>1995</year>
          ). URL: http://arxiv.org/abs/cmp-lg/9505040.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>J.-B. Lamy</surname>
          </string-name>
          , Owlready:
          <article-title>Ontology-oriented programming in python with automatic classification and high level constructs for biomedical ontologies</article-title>
          ,
          <source>Artificial Intelligence in Medicine</source>
          <volume>80</volume>
          (
          <year>2017</year>
          ).
          <source>doi:1 0 . 1 0 1 6 / j . a r t m e d . 2 0 1 7 . 0 7 . 0 0 2 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Dias</surname>
          </string-name>
          ,
          <article-title>Typewritten Digital Representations of Portuguese Cultural Heritage Documents from the 20th century, Data set</article-title>
          ,
          <year>2022</year>
          .
          <source>doi:1 0 . 2 5</source>
          <volume>7 4</volume>
          7 / Z C 2
          <volume>5 - 1 5 3 1 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Falcão</surname>
          </string-name>
          ,
          <article-title>Manual Transcriptions of Typewritten Digital Representations of Portuguese Cultural Heritage Documents from the 20th Century, Data set</article-title>
          ,
          <year>2022</year>
          .
          <source>doi:1 0 . 2 5</source>
          <volume>7 4</volume>
          <fpage>7</fpage>
          <string-name>
            <surname>/ W P N A - J E 3</surname>
          </string-name>
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Seco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cardoso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vilela</surname>
          </string-name>
          ,
          <string-name>
            <surname>HAREM:</surname>
          </string-name>
          <article-title>An Advanced NER Evaluation Contest for Portuguese</article-title>
          ,
          <source>in: Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC'06)</source>
          ,
          <source>European Language Resources Association (ELRA)</source>
          , Genoa, Italy,
          <year>2006</year>
          . URL: http://www.lrec-conf.org/proceedings/lrec2006/pdf/59_pdf.pdf.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>