<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Information Extraction from Historical Texts: a Case Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paulo Quaresma</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Jose Bocorny Finatto</string-name>
          <email>mariafinatto@gmail.com</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidade Federal do Rio Grande do Sul</institution>
          ,
          <addr-line>RS</addr-line>
          ,
          <country country="BR">Brasil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidade de Evora</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper a set of information extraction experiments over historical texts are described. The experiments were done over the Spanish book Observaciones de Curvo written by Suarez de Ribera in 1735 based on the 1707 Curvo Semedo's work Observaciones medicas doutrinaes de cem casos gravissimos to evaluate which information can be extracted in a fully automatized way. Using publicly available NLP tools we extracted named entities (persons and places) and identi ed events. This information was used to populate a specialized ontology, allowing the application of powerful visualization and inference processes. A preliminary evaluation of the quality of the extracted information showed that, in spite of the use of generic NLP tools, this process is able to automatically identify relevant information and to help human experts in the creation of historical knowledge bases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Text information extraction is an increasingly relevant NLP task, aiming to
automatically structure unstructured text. On the other hand, historical documents
have a huge potential amount of information, which is not easily accessible to
researchers or citizens.</p>
      <p>In this context, a project aiming to automatically populate a specialized
ontology with information extracted from historical texts was created by
researchers of the University of Evora, Portugal, and UFRGS { Universidade
Federal do Rio Grande do Sul, RS, Brasil.</p>
      <p>In this paper we describe the initial experiments done over the Spanish book
Observaciones de Curvo written by Francisco Suarez de Ribera in 1735 based on
the 1707 Curvo Semedo's work Observaciones medicas doutrinaes de cem casos
gravissimos.</p>
      <p>It is important to refer that all the processing was done by applying NLP
computational tools without any human intervention: from the OCR until the
ontology population. Our main goal with this option was to analyse and evaluate
how a pipeline of computational processes was able to deal with historical texts
in a fully automatized way.</p>
      <p>In the next section we will brie y describe the used corpus; in section 3 we
will present the applied methodology; in section 4 a more detailed discussion
of the NLP architecture will be presented; in section 5 the used ontology is
described; and in section 6 a preliminary evaluation is presented and discussed.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Corpus</title>
      <p>As already referred, as base corpus for the research work we have selected the
book Observaciones de Curvo written by Francisco Suarez de Ribera in 1735
based on the Curvo Semedo's work Observaciones medicas doutrinaes de cem
casos gravissimos. This book is available for download (in pdf and txt formats)
from the Spanish National Library3 and is composed by 549.267 tokens. It is
important to refer that the text version is the output of a OCR process but
it was not revised. An example of the existent problems can be seen from the
initial sentences:</p>
      <p>OBSERVACION PRIMERA.</p>
      <p>DE UNA COLICA NEPHRITICA, que a igio al Excekntifsimo fen~or Principe
de Ligne, y Marques de Arronches.</p>
      <p>N feis de E n e r o del an~o de 1 68&lt;$. a igio al dicho Excelentifsimo fen~or la
colica nephritU ca : t e n g o a Tentado, que la verdadera cien-: cia no con ne
en lo r u i d o f o , o campanudo de las p a l a b r a s , ni en la apariencia, o
pompa exterior de los v e n i d o s , mas si en las obras ordenadas con acier"&lt; t
o , y efectuadas con felicidad:</p>
      <p>As it can be seen from this example there are many problems with the OCR
quality of the document. We had two main options: a) manually or semi-manually
revise the texts; b) use the text as is, without any revision. We decided to select
option b) because one of our goals was to evaluate which information can be
extracted from historical documents in a fully automatized way.</p>
      <p>
        A distinct approach was followed in a distinct project over the original work
of Curvo Semedo[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this work the basis was a fully revised version of the
digitized images.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>1. Lexical analysis
2. Syntactical analysis
As working methodology we followed a classical NLP pipeline of processes:
3 http://bdh-rd.bne.es/viewer.vm?id=0000081871&amp;page=1
3. Semantical analysis
4. Ontology population</p>
      <p>The lexical and syntactical analysis identi es lemmas, perform part-of-speech
tagging and dependency parsing. With the semantic analysis we are able to do
named entities recognition (persons, organizations, places, time) and semantic
role labelling over the results of the previous modules. The last module { ontology
population { receives as input the output of the previous module and creates
instances of an ontology, allowing the formal representation of the extracted
information.</p>
      <p>
        As NLP tools we have used Freeling[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] from Llu s Padro, which supports
several languages, including Spanish and Portuguese.
      </p>
      <p>
        In the scope of this work we have focused on the extraction of entities and
events and we have used a specialized ontology developed in OWL in the context
of another research project[
        <xref ref-type="bibr" rid="ref6">6, 7</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>NLP tools</title>
      <p>The architecture is based on a pipeline of NLP modules and it is represented in
gure 1.</p>
      <p>Each sentence is processed by a series of modules - part of speech tagging,
named entity recognition, dependency parsing, semantic role labelling,
subjectverb-object identi cation, and the creation of ontology instances in OWL.</p>
      <p>The main goal is to identify events in the text, which are used to populate a
prede ned ontology.</p>
      <p>
        As already referred we used the Freeling framework [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for this pipeline of
event extraction:
      </p>
      <p>Below is an example of the obtained output for a simple sentence of the
corpus:
... y como por caufa de ella murieron en una cafa cinco hijos ...</p>
      <p>The NER output is:
y y CC 0.999989
como como CS 0.967153
por por SP 1
caufa caufa NCFS000 0.919869
de de SP 0.999961
ella el PP3FS00 1
murieron morir VMIS3P0 1
en en SP 1
una uno DI0FS0 0.951973
cafa cafa NCFS000 1
cinco 5 Z 0.999454
hijos hijo NCMP000 1</p>
      <p>And the nal Freeling output is:
Pred100.4: die.01|expire.01|perish.01 t100.20 (murieron) [
AM-LOC t100.21 (en una cafa) [t100.21 .. t100.23]</p>
      <p>
        A1 t100.25 (cinco hijos) [t100.24 .. t100.25]
According to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] an ontology is a formal speci cation of a conceptualization; it
allows the representation of entities and events, along with their properties and
relations, according to a system of categories.
      </p>
      <p>In the context of this work we used as basis the Simple Event Model (SEM)
as a baseline model. A graphical representation of this ontology is given in Fig. 2
and its detailed design is presented in [8].</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>After applying the pipeline of NLP processes we were able to obtain the following
output:
{ NER (named entities)
410 places (282 distinct)
2011 persons (1294 distinct)
{ Events
14005 events
2896 with subject { A0
8271 with direct objects { A1
1685 with indirect objects { A2
901 with place information { AM-LOC
2747 with adverbials { AM-ADV</p>
      <p>We have done a preliminary evaluation of the quality of the extraction
processes and we were able to calculate the precision of the extraction, i.e. the
percentage of extracted concepts that are correct.</p>
      <p>For each kind of information we present bellow this value and some examples:
{ Places</p>
      <p>Precision: 21%</p>
      <p>Sevilla
Lisboa
Salamanca</p>
      <p>Rua de la Paz
{ Persons</p>
      <p>Precision: 22%</p>
      <p>Curvo
Cardenal
Rey
Hypocrates</p>
      <p>Galeno
{ Events</p>
      <p>Precision: 5% (precision was calculated over a sample of 10% of the
extracted events)</p>
      <p>Verbo: murieron; A1: cinco hijos; AM-LOC: en una cafa
Verbo: caera; A1: la enferma; AM-ADV: en una hydrope a
Verbo: dar; A1: la Uncion; A2: a el paciente
Verbo: tomava; A1: una taza com caldo, o agua; AM-LOC: en las
manos; AM-ADV: hirviendo</p>
      <p>As it can be seen from these examples, it was possible to extract relevant
information and, in some cases, with an associated high level of complexity
(see, for instance, the last example of the events). Nevertheless, the precision of
extraction is still quite low.</p>
      <p>We have made an analysis of the main error situations and the main sources
of errors can be characterized in the following way:
{ Places: Most of the errors are related with incorrect OCR and with
misclassi ed entities (e.g. persons classi ed as places).
{ Persons: The main source of errors for this class of entities is clearly the bad
quality of the initial OCR, which created many incorrect words that, being
unknown to the NLP tools, tend to be classi ed as proper nouns.
{ Events: As expected, the precision for this kind of information is very low.</p>
      <p>This can be explained, again, by the poor quality of the initial OCR system
and the nonexistence of a lexical and syntactical module adapted to the
Spanish language of the XVIII century.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>As main conclusion we believe we were able to show that it is possible to use
standard NLP computational tools to automatically extract information from
historical texts.</p>
      <p>It is important to emphasize that we presented an initial evaluation with a
not revised corpus directly generated from a OCR system. Thus, the obtained
results are far from perfect and they show the relevance of having good quality
texts as input to the processing pipeline. Nevertheless, the proposed approach
can be used as a baseline and a basis for additional research work.</p>
      <p>As future work, a deeper evaluation of the results should be done and NLP
tools adapted to the lexicon and syntax of the historical corpora need to be
developed.</p>
      <p>We will also foresee the creation of a web based application for the
visualization and access to the created ontology.
7. Quaresma, P., Nogueira, V.B., Raiyani, K., Bayot, R.: Event extraction and
representation: A case study for the portuguese language. Information 10(6) (2019),
https://www.mdpi.com/2078-2489/10/6/205
8. Van Hage, W.R., Malaise, V., Segers, R., Hollink, L., Schreiber, G.: Design and use
of the simple event model (sem). Web Semantics: Science, Services and Agents on
the World Wide Web 9(2), 128{136 (2011)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Graphdb, http://graphdb.ontotext.com/, [Available online: accessed on 24/02/2020]</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. Protege, https://protege.stanford.edu/, [Available online: accessed on 24/02/2020]</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Finatto</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quaresma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goncalves</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          :
          <article-title>Portuguese corpora of the 18th century: old medicine texts for teaching and research</article-title>
          . In: Fiser,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Pancur</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of the Conference on Language Technologies and Digital Humanities</source>
          , Ljubljana, Slovenia,
          <source>September</source>
          <volume>20</volume>
          -
          <fpage>21</fpage>
          (
          <year>2018</year>
          ), http://dspace.uevora.pt/rdpc/handle/10174/23606
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Guarino</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giaretta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Ontologies and knowledge bases: Towards a terminological clari cation</article-title>
          . In:
          <article-title>Towards very Large Knowledge bases: Knowledge Building and Knowledge sharing</article-title>
          . pp.
          <volume>25</volume>
          {
          <fpage>32</fpage>
          . IOS Press (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Padro</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanilovsky</surname>
          </string-name>
          , E.:
          <article-title>Freeling 3.0: Towards wider multilinguality</article-title>
          .
          <source>In: Proceedings of the Language Resources and Evaluation Conference (LREC</source>
          <year>2012</year>
          ). ELRA, Istanbul, Turkey (May
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Quaresma</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nogueira</surname>
            ,
            <given-names>V.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raiyani</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bayot</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goncalves</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>From textual information sources to linked data in the agatha project</article-title>
          .
          <source>In: INAP { Proceedings of the 22nd International Conference on Applications of Declarative Programming and Knowledge Management</source>
          , Cottbus, Germany, September 9-
          <issue>13</issue>
          ,
          <year>2019</year>
          . pp.
          <volume>1</volume>
          {
          <issue>11</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>