<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Event identification in the Monsoon Books (1616-1618)⋆</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>CIDEHUS, Universidade de E</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidade Federal da Bahia</institution>
          ,
          <country country="BR">Brasil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1822</year>
      </pub-date>
      <fpage>1616</fpage>
      <lpage>1618</lpage>
      <abstract>
        <p>The Estado da ´India constituted the most complex overseas Portuguese set of territories. This paper investigate a digital methodology employing Natural Language Processing to study historical events regarding it during the period 1616-1618. We explore the application of an event extraction tool over an extract of the The Monsoon Books. Our preliminary results expose the current problems and help us shape further work for automatic processing of historic corpora.</p>
      </abstract>
      <kwd-group>
        <kwd>Event identification</kwd>
        <kwd>17th century Portuguese</kwd>
        <kwd>Portuguese India</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>A process like this applied to the study of a colonial macro-region as the eastern
Portuguese Empire identifies and categorize human actions and episodes which
determine not only patterns of historical junctures in time and place, but also
disruptive events that underline changing processes.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Monsoon Books</title>
      <p>
        The Documentos Remetidos da ´India or Livros das Monc¸˜oes (Monsoon books)
collect letters exchanged between the monarchs and Portuguese government
councils and India viceroys where all types of afairs concerning the so-called
Portuguese Estado da I´ndia were discussed. They comprise a geographical scope
from Eastern Africa to Japan. The use of this collection is paramount to
understand the internal dynamics of the Portuguese Estado da ´India until the 19th
century. In fact, they are considered the core documents produced by Portuguese
authorities in Asia. The fact of being a type of documental corpora concerning
all types of issues makes the Monsoon Books unique and a privileged lab for
building a new analytic model and approach to understand internal dynamics of
colonial empires macro-regions. The Monsoon Books are composed by the sets of
documents located in both in the Portuguese National Archives, in Lisbon, and
in the Historical Archives of Goa, in Panjin, India. Since this paper intends to
test an automatic event extraction model in order to conceive an interpretative
framework of European colonial presence in overseas macro-regions, we employ
some of the already transcribed and published books referring to the years of
1616-1618 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Event Identification and Classification</title>
      <p>
        The goal of event identification and classification is to detect the event mentions
of target event types in plain text. Given an input text, an event detection
(ED) system should be able to identify whether the sentences contains events of
interest by means of the identification of event trigger terms (event identification)
and classify them into specific event types (event classification). For instance, in
the following sentence:
“Meridian National Corp. said it sold 750,000 shares of its common
stock to the McAlpine family interests, for $1 million, or $1.35 a share.”
According to the TimeBankPT corpus [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the words “said” and “sold”
describe event occurrences (triggers) for two distinct event mentions, one of type
Statement and the other of type Commerce Selling, respectively, if we consider
the FrameNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] lexicon as a source of target event types. Most of the work
done on ED in the literature has focused on contemporary variants of
European languages [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and, even for the contemporary variants, few studies have
addressed event identification and classification for the Portuguese language.
Notably, the work of Sacramento and Souza [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] describes a method and the only,
to our knowledge, publicly available system for event extraction on Portuguese
sentences. TEFE encodes ED as a sequence labelling problem and employs
bidirectional recurrent neural networks to simultaneously predict event triggers and
their types. It was trained on an enriched TimeBankPT corpus with events types
from the FrameNet project, using deep neural networks and contextualized word
embeddings from a Portuguese BERT model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>In this work, we employ TEFE to historical data, on the previously
transcribed texts from the Monsoon Books, and evaluate its usefulness to the
identification of historical junctures in historical corpora.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Applying Event Identification to the Monsoon Letters</title>
      <p>Each volume of Monsoon books encompasses more than 300 printed pages
of narrative text. In this sense, applying a computational tool of event
extraction enhances historical research to rapidly extract text information for a large
collection of data. One way of studying this source is by organizing the events
it describes. Language technology may help the reader with hints, extraction
and quantification of these events. The system described in the last section was
developed with the purpose of finding and classifying mentions to events, as
well as identifying the participants of the events. The system receives an
input sentence such as “Eu el-rey fac¸o saber aos que este alvaar´ virem que tenho
entendido que pelo mau comcerto que tiveram as naus que, o anno passado de
seiscentos e quinze, partiram do porto de Goa para este reino[...]” and identify
that “partiram” (departed) is an instance of a Departure event (described by
the Departing Frame) and that “as naus” (the ships), “porto de Goa” (Goa’s
harbor) and “este reino” (this kingdom) are entities participating in such event,
as depicted below:
as naus que[...]
|T h{ezme}</p>
      <p>partiram
trig|ger:Depar}ting
{z
do porto de Goa para este reino
| So{uzrce } | G{ozal }</p>
      <p>In this work we present an initial assessment of the application of this system
in the source under study, to understand the needs for adapting the tool for
language variants (from a diferent time span, in this case). Next we present
some passages of the studied source, isolated in five examples, and what the
system has produced as output, along with a discussion of the challenges to
face.</p>
      <p>Example 1: Eu el-rey fac¸o saber aos que este alvaar´ virem que tenho
entendido que pelo mau comcerto que tiveram as naus que, o anno
passado de seiscentos e quinze, partiram do porto de Goa para este reino,
e por virem sobrecarregadas, em saindo da´quella barra comec¸aram logo
a fazer agoa, o que foi causa de se perderem as naus Capitania e Sam
Boaventura, e a Sam Philippe vir mui arriscada, e por esse respeito
receber minha fazenda e meus vassalos noat´vel perda;</p>
      <p>The passage mentions a communication made about bad mantainance of
ships and the consequent loss of value related to overweighted ships coming from
Goa to Portugal which sunk. In this passage, as presented in Table 1, 7 events
were identified, related to the communication brought about bad maintenance
of the ships, the leave of these ships from Goa, the start of the problems caused
by the bad maintenance, which caused the loss of values due to the problems
encountered with 3 of the ships.</p>
      <p>Example 2: e porque convem muito a meu servci¸o saber-se com a
considerac¸˜ao necessaria de como se procedeo no concerto e carga das ditas
naus, e se ouve culpa de alguem de partirem tarde, hei por bem e mando
ao Desembargador Gonc¸alo Pinto da Fonseca tire devassa na
conformidade d´este alvara, sabendo mui particularmente a causa que houve pera
as ditas naus virem sobrecarregadas e tam mal concertadas, e partirem
tarde, e como se procedeo no concerto e carga d´ellas como se refere;
In continuation from the previous passage, here there is a demand for the
causes of the poor maintenance of the ships and how this was done, asking also
for the identification or the responsible related to the delay of their departure,
the causes for the overweight. The results of processing this passage on TEFE
is depicted on Table 1. Two events identified were related to the requirement
of information, and the third (removing) was in fact used as asking rather than
removing, an error which might be explained due to verb ambiguity.</p>
      <p>Example 3: e depois de tirada a dita devassa a mande serrada por vias
nas primeiras naus que pera este reino partirem da´quellas partes, dirigida
ao Conselho de minha fazenda, pera se ver n´elle e prover no caso como
mais convier a meu servci¸o;</p>
      <p>At this point, instructions are given on how the answer to the enquiry should
be sent. The event “ver” (to see), depicted in Table 1, is identified but in this
case the present stage of development of this tool is not flexible enough and
comprises misunderstood meanings of the text, common in a Portuguese text
with 400 years old.</p>
      <p>Example 4: o que cumpriar´ sem duvida alguma, por este que valear´
como carta e n˜ao passaar´ pela chancellaria, o qual vai por tres vias.
Francisco de abreu o fez em Lisboa a xbij (desasete) de fevereiro de
seiscentos e desaseis. Diogo Soares o fez escrever.</p>
      <p>In this particular example, the tool enables us to automatically understand
the material authorship of the particular letter, identifying an agent responsible
for the act of writing the document in a certain space and time, as can be
seen in Table 1. In this analysis, it is not relevant to make an analysis of word
statistics, since the documental corpora was not entirely digitally intervened
yet. By the results shown in Table 1 we immediately perceive how the diverse
documents present in the analysed volume of Monsoon Books were used to report
to the metropolis or to India about the remote events relevant to the Portuguese
administration of Estado da India and the Cape Route in the beginning of the
seventeenth century. The most common event type is awareness underlining the
role of communication that these documents performed.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Final remarks</title>
      <p>This paper analyses the application of event identification, extraction and
classification as a step for the study of historical sources. This initial study was
required to know the current problems and help us shape further work. As such,
we are now planning to annotate events in a portion of this historical source.
Annotation will serve for adapting the tool for this specific temporal language
variant and the required historic study needs, and also for more rigorous
evaluation of the extraction provided by the tool. Regarding the language diferences,
we plan to study whether normalization could improve the extraction of events.
The adapted version of the tool will be used for further studies about this
historical source. These tools may serve as reader´s guidance for the observation of
historical sources contents. The tool enhances a more eficient process of
information extraction of huge series of historical texts. From the examples shown,
and regarding the nature of this source, we may consider uses such as helping
researchers to find all events related to ships, actions of the Crown, conflicts,
etc. If associated with other NLP tools it would be possible to create a list of
the names of the ships, perhaps finding information about the travels made by
them. It also serves the purpose to observe the geographical location of a person
in a certain time and space and the type of actions they perform as historical
characters. Even a statistical analysis of the type of events associated with time
and places can trace new insights in the historical interpretation of a certain
reality. We can detect patterns of human action by the most frequent event types
on a certain time frame, as well as interpret the exceptional events and evaluate
their historical meaning.If we are able to cross examine the interaction between
the most frequent event types and the type of arguments they were most linked
to in the future, we can easily perceive historical trends in the type of events
which mostly concern the Portuguese administration in the Eastern part of the
Portuguese empire.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>C.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fillmore</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          :
          <article-title>The Berkeley FrameNet project</article-title>
          .
          <source>In: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics</source>
          , Volume
          <volume>1</volume>
          . pp.
          <fpage>86</fpage>
          -
          <lpage>90</lpage>
          . Association for Computational Linguistics, Montreal, Quebec,
          <source>Canada (Aug</source>
          <year>1998</year>
          ). https://doi.org/10.3115/980845.980860, https://aclanthology.org/P98-1013
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Costa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Branco</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>TimeBankPT: A TimeML annotated corpus of Portuguese</article-title>
          .
          <source>In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)</source>
          . pp.
          <fpage>3727</fpage>
          -
          <lpage>3734</lpage>
          . European Language Resources Association (ELRA), Istanbul, Turkey (May
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Patto</surname>
            ,
            <given-names>R.A.B.</given-names>
          </string-name>
          :
          <article-title>Documentos Remetidos da India ou Livros das Monc˜o¸es</article-title>
          . Tomo IV.
          <string-name>
            <surname>Academia Real das Ciˆencias de Lisboa</surname>
          </string-name>
          ,
          <string-name>
            <surname>Lisboa</surname>
          </string-name>
          (
          <year>1893</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Pearson</surname>
            ,
            <given-names>M.N.:</given-names>
          </string-name>
          <article-title>The Portuguese in India</article-title>
          . Cambridge University Press, Cambridge (
          <year>1987</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Pooln´ia, A.:
          <article-title>Indivıd´ uos e redes auto-organizadas na construca¸˜o do imper´io ultramarino portuguˆes</article-title>
          . In: Garrido,
          <string-name>
            <given-names>A</given-names>
            ´.,
            <surname>Freira</surname>
          </string-name>
          <string-name>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Duarte</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . (eds.) Economia,
          <article-title>Institucio˜¸es e Imper´io. Estudos em Homenagem a Joaquim Romero de Magalha˜es</article-title>
          . pp.
          <fpage>349</fpage>
          -
          <lpage>372</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sacramento</surname>
            ,
            <given-names>A.d.S.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Joint event extraction with contextualized word embeddings for the portuguese language</article-title>
          .
          <source>In: Brazilian Conference on Intelligent Systems</source>
          . pp.
          <fpage>496</fpage>
          -
          <lpage>510</lpage>
          . Springer (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nogueira</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lotufo</surname>
          </string-name>
          , R.:
          <article-title>BERTimbau: pretrained BERT models for Brazilian Portuguese</article-title>
          .
          <source>In: 9th Brazilian Conference on Intelligent Systems, BRACIS</source>
          . pp.
          <fpage>403</fpage>
          -
          <lpage>417</lpage>
          . Springer, Rio Grande do Sul,
          <source>Brazil</source>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sprugnoli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tonelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Novel event detection and classification for historical texts</article-title>
          .
          <source>Computational Linguistics</source>
          <volume>45</volume>
          (
          <issue>2</issue>
          ),
          <fpage>229</fpage>
          -
          <lpage>265</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>