<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An annotation proposal based on TEI Schema for Portuguese Corpora editions: A solution for e-Dictor XML annotation problem</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Federal Institute of Education, Science and Technology of Bahia</institution>
          ,
          <addr-line>Vitória da Conquista</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>State University of Southwest Bahia</institution>
          ,
          <addr-line>Vitória da Conquista</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of São Paulo</institution>
          ,
          <addr-line>São Paulo</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Annotated multilayer data approaches are becoming widespread in historical corpora. In this context, the Tycho Brahe Corpus of Historical Portuguese was the pioneer in this approach and several other initiatives emerged based on its annotation, generated by e-Dictor software. During the research conducted with corpora based on this annotation scheme, some gaps were identified to cater to the requirements for manuscript sources and conformity increase using annotation standards. These requirements include the TEI (Text Encoding Initiative) Guidelines, which provide a standard format for encoding of texts, to achieve more interoperability, while at the same time solving problems of missing adequacy in the current encoding, among others. This work presents an annotation scheme proposal for the philological editions (or editorial interventions) and morphological analyses of Portuguese historical corpus encoded by XML (eXchange Markup Language). The presented encoding scheme fulfills the reliability requirement for this kind of corpora while achieving more adequacy, conformity, and, consequently, more interoperability.</p>
      </abstract>
      <kwd-group>
        <kwd>TEI</kwd>
        <kwd>Historical Corpora</kwd>
        <kwd>Multilayer Corpora Annotation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Multilayer corpora are an emerging methodology within the linguistics corpus,
pushing linguistic research to new frontiers. These corpora bring together multiple
independent analyzes of the same linguistic phenomena by favoring the interplay of
these concurrent analyzes. Multilayer approaches are also spreading in historical
corpora, allowing to merge the manuscript structure around the text with representations
of morphological analyzes and syntactic treebanks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the context of the historical
corpora of the Portuguese language, the Tycho Brahe Corpus - TBC [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was the
pioneer in this approach, bringing together the philological editions encoding with
morphological and syntactic analyzes. The TBC is a historical corpus composed of
Portuguese texts from the 14th to the 19th centuries, which were ported to digital support.
The current editorial interventions annotation, designed by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], is expressed using
XML. It defines XML tags for each variant point in the text, favoring to keep and
recover the original forms and their edited versions. This encoding is applied by
eDictor software [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], according to editorial interventions made by the user in its
graphic interface. The scheme generated by the e-Dictor must fulfill the requirement
conformity of the original text to the historical sources, postulated by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
corroborated by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in the context of historical manuscript sources.
      </p>
      <p>
        Other Portuguese corpora project initiatives emerged in Brazil based on e-Dictor
annotation, and nowadays at least seven large projects are using it [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Among these
projects is the DOViC Corpus (‘Corpus de Documentos Oitocentistas de Vitoria da
Conquista’, Vitoria da Conquista Nineteenth Century Documents Corpus, South-west
region of Bahia, Brazil) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. New requirements arose for the e-Dictor XML annotation
scheme as the research progressed. For manuscript documents of DOViC, it was
necessary to encode several data for this kind of document, which was not designed by
the e-Dictor annotation scheme. As the Portuguese language is among the
lowresources languages [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], it is important to adopt standard formats or models under
well-documented and accepted norms, such as the TEI Guidelines [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The adoption
of less idiosyncratic models or representations of data contributes to the dissemination
and expansion of research with the Portuguese language, as this favors the exchange
of data and its conversion for use with new annotation tools, increasing the
probability that other researchers can use the corpus and even extend the data. The e-Dictor
scheme also had problems in the adequacy of word representation when segmentation
or join edits occurred.
      </p>
      <p>Aiming to achieve more sharing, merging, and comparison of Portuguese language
resources and resolve the found problems, a proposal of a new annotation scheme for
the historical corpora of the Portuguese language encoded in XML is being
developed. This action is part of ongoing doctoral dissertation research within the scope of
the Postgraduate Program in Linguistics (PPGLIN) offered by Universidade Estadual
do Sudoeste da Bahia – UESB (Southwest of Bahia State University). The scheme
under development targets greater adequacy, conformity, and, consequently, more
interoperability. It must be TEI-conformant and meet the requirement of reliability to
original texts. A new version of e-Dictor, as a web-based application, will be
developed, incorporating the new annotation proposed. The result can be used to any
written corpus of the Portuguese language, either manuscript, oral transcript, or written.</p>
      <p>This paper focuses on the philological editions encoding layer annotation, cutting
through the proposal annotation scheme under development, presenting specifically
an example of segmentation edition followed by modernization of the word spelling.
The objective of this work is to present and shortly discuss a proposal of a
TEIconformant scheme for the editorial interventions, meeting the requirements required
by research with historical corpora. The corpus used in this research extract is
DOViC, using an excerpt from the manuscript document ‘Carta de Liberdade do
Cabrinha Bernardo’ (Manumission Letter from Slave Bernardo), which is part of this
corpus.
2</p>
    </sec>
    <sec id="sec-2">
      <title>TEI-conformant proposal for philological editions</title>
      <p>The e-Dictor XML encoding uses the &lt;w&gt; element to represent a word, the &lt;o&gt;
element to represent the original form of transcript text, and &lt;e&gt; element for any
changes made by the editor. The editing type is encoded in the “type” attribute, which
can have the values “seg” for segmentation, “jun” for junction, “mod” for
modernization, and “gra” for spelling edits assigned to it. Figure 1 presents an excerpt of the
document annotated in the current scheme on the left side, where the passage
“dehum” (“of a”) is composed of two words, which were originally written together (line
4 of Figure 1). The editor segmented the passage, whose encoding is represented in
line 5 of Figure 1. Then, the word “hum” (“a” or “an” in English), written as “um”
nowadays, was modernized, which is represented in line 6 of Figure 1. The &lt;m&gt;
elements encode the morphological analysis, assigning the POS category to the “v”
attribute value. The element content should match the word analyzed by the POS
tagger.</p>
      <p>We consider that the e-Dictor annotation in this situation is semantically
misleading, as it keeps two words “de” and “hum” in a single &lt;w&gt; element. The attribution of
the “D-UM” tag (equivalent to indefinite determinant) was realized by the POS tagger
from the string “de hum”, &lt;e&gt; element content of line 6, but the content that received
this tag POS corresponds to the unmodernized version “hum”, as encoded in line 8.
The annotation also does not explicit that the modernization corresponds only to the
word “hum”, rather than the whole string “de um” such as modernization of “de
hum”, as the encoding indicates in lines 5 and 6.</p>
      <p>
        Figure 1 presents the same excerpt from the previous example encoded with the
new proposal on the right side. For the new annotation scheme, we propose that each
“token” coming from the transcription step is encoded in the &lt;orig&gt; element,
recommended by TEI as an indication that the reading follows the original. In the edition
phase, each &lt;orig&gt; element will be wrapped in an &lt;ab&gt; element (defined by TEI as
“anonymous block”). All changes made by the editor to the &lt;orig&gt; content will be
recorded in sequential &lt;reg&gt; (regularization) elements within the parent &lt;ab&gt; block.
The TEI Guidelines define that &lt;reg&gt; element may be used for any kind of
regularization, including normalization, standardization, and modernization [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The edition
type will be encoded in the “type” attribute and the “corresp” attribute links the
edition realized to the element being normalized. The tokens generated by the
segmentation are encoded in &lt;seg&gt; elements (defined to segment units) as descendants of the
element representing the edition that generated them. Each word will be encoded by
the &lt;w&gt; element and the POS category will be encoded by the “pos” attribute. This
corresponds to morphological layer annotation and the “pos” value attribution will be
realized after tagger POS runs.
      </p>
      <p>We encode &lt;w&gt; with the “same as” attribute, which points to an element whose
content is the same as the current element. It is useful to represent the fact that one
element of a text is identical to others. Alternatively, the &lt;w&gt; elements (lines 18 and
19 of Figure 1) could be removed from the &lt;ab&gt; block and encoded by the stand-off
method, into a single block that brings together all the words with the POS category
annotation elsewhere in the same file or even in another file. The proposal of a new
encoding scheme represents adequately the words in two &lt;w&gt; elements and explicitly
marks the linking between the edition and the element that corresponds to it. In the
modernization of “hum”, encoded in line 17 of Figure 1, the attribute “corresp” value
references the segment whose “xml:id” is “s9_2”, which encodes what content is
being modernized.</p>
      <p>The new annotation scheme will be implemented in a new version of e-Dictor
(version 2.0) that will be developed within the scope of this work. e-Dictor 2.0 will
replace the annotator of edits in the current format (e-Dictor 1.0), which has a desktop
GUI (Graphic User Interface). The new version will be a web-based application and
like e-Dictor 1.0, it will have the POS tagger embedded in its code, calling it after
editorial interventions made by the user. The link between edits and segments (&lt;seg&gt;
elements) shown in the annotation proposal will be generated according to the user's
actions in the software's GUI. This proposal is also suitable for join edits, which
generate a word from two or more &lt;orig&gt; elements. Changing the annotation does not
impose any overload on e-Dictor software.</p>
      <p>
        The complete result from ongoing research aims to develop a syntactic annotation
scheme aligned to other layers plus the annotation of metadata. The resulting scheme
can be used to any written corpus of the Portuguese language, either manuscript, oral
transcript, or written. Although focused on historical corpora, the scheme is also
intended for contemporary corpora. The Carolina Corpus (‘Corpus Aberto para
Linguística e Inteligência Artificial’, Open Corpus for Linguistics and Artificial
Intelligence), a large open corpus of Brazilian Portuguese texts, which was released in
March 2022, has already adopted the metadata annotation scheme developed in this
work [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Final considerations</title>
      <p>The proposed scheme for morphological analysis and editorial interventions for
Portuguese historical corpora, presented in this paper, solves the inadequate
representation of words and editorial interventions found in e-Dictor encoding XML. By
changing the current to TEI-conformant schema, it achieves a less idiosyncratic
scheme, based on a widely accepted standard for annotation of digital texts. Thus,
greater interoperability is achieved and there will be a greater possibility that other
researchers will use the corpora that use the e-Dictor software, annotated in this
format. The complete research aims to present to the research community a multilayered
annotation schema for Portuguese historical corpora, with TEI-conformant syntactic
annotation aligned to the other layers, joining appropriately developed software for its
adoption, thus contributing to the expansion of research with the Portuguese
Language and to moving it out of the low-resources languages scenario.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgment</title>
      <p>We thank FAPESB and CNPq as this work is linked to thematic projects funded by FAPESB
(APP0007/2016 and APP0014/2016) and CNPq (436209/2018-7); the Graduate Program in
Linguistics (PPGLIN); to the Corpus Linguistics Research Laboratory (LAPELINC); to the
State University of Southwest Bahia – UESB; and advisors Prof. Dr. Jorge Viana Santos and
Prof. Dr. Cristiane Namiuti.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Zeldes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Multilayer Corpus Studies</article-title>
          . 1st eds. Routledge, New York (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Galves</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrade</surname>
            ,
            <given-names>A. L. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faria</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Tycho Brahe parsed corpus of historical Portuguese</article-title>
          . Unicamp,
          <string-name>
            <surname>Campinas</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Paixão de Sousa, M. C.:
          <string-name>
            <surname>Sistema de Edições</surname>
          </string-name>
          <article-title>Eletrônicas do Corpus Tycho Brahe</article-title>
          . Unicamp,
          <string-name>
            <surname>Campinas</surname>
          </string-name>
          (
          <year>2007</year>
          ). Homepage https://www.tycho.iel.unicamp.br/corpus/manual/prep/manual_frameset.html,
          <source>last accessed</source>
          <year>2021</year>
          /10/23.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Paixão de Sousa,
          <string-name>
            <given-names>M. C.</given-names>
            ,
            <surname>Kepler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Faria</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.:</surname>
          </string-name>
          <article-title>eDictor: novas perspectivas na codificação e edição de corpora de textos históricos</article-title>
          . In: Shepherd,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Sardinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.B.</given-names>
            ,
            <surname>Pinto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. V.</given-names>
            <surname>Caminhos</surname>
          </string-name>
          da Linguística de Corpus. Mercado de Letras,
          <string-name>
            <surname>Campinas</surname>
          </string-name>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Paixão de Sousa, M. C.:
          <article-title>Memória do Texto</article-title>
          .
          <source>In: Revista Texto Digital, n. 2</source>
          .
          <string-name>
            <surname>Universidade Federal de Santa Catarina</surname>
          </string-name>
          , Santa
          <string-name>
            <surname>Catarina</surname>
          </string-name>
          (
          <year>2006</year>
          ). Homepage http://www.textodigital.ufsc.br/num02/paixao.htm,
          <source>last accessed</source>
          <year>2019</year>
          /02/23.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. Paixão de Sousa, M. C.:
          <string-name>
            <surname>e-Dictor Homepage. Universidade de São Paulo</surname>
          </string-name>
          (USP),
          <source>São Paulo</source>
          (
          <year>2022</year>
          ). Homepage https://edictor.net/projetos-envolvidos/,
          <source>last accessed</source>
          <year>2022</year>
          /03/08.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>J. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Namiuti</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>O objeto livro: a complexidade da forma e o digital</article-title>
          .
          <source>In: X Congresso Internacional da ABRALIN. Universidade Federal Fluminense</source>
          ,
          <string-name>
            <surname>Niterói</surname>
          </string-name>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>J. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Namiuti</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>DOViC - Documentos Oitocentistas de Vitória da Conquista</surname>
          </string-name>
          .
          <source>Universidade Estadual do Sudoeste da Bahia, Vitória da Conquista</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Center for Artificial Intelligence.
          <article-title>Research in the C4AI</article-title>
          .
          <string-name>
            <surname>Universidade Estadual de São Paulo</surname>
          </string-name>
          (USP),
          <source>São Paulo</source>
          (
          <year>2021</year>
          ). HomePage: http://c4ai.inova.usp.br/research/#NLP2, last accessed
          <year>2022</year>
          /03/02.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>TEI</surname>
          </string-name>
          <article-title>Consortium: TEI P5: Guidelines for Electronic Text. 4.2.1 version (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>Center for Artificial Intelligence</article-title>
          .
          <source>Carolina (Corpus Aberto para Linguística e Inteligência Artificial)</source>
          .
          <source>Universidade Estadual de São Paulo (USP)</source>
          ,
          <source>São Paulo</source>
          (
          <year>2022</year>
          ). Homepage http://sites.usp.br/corpuscarolina, last accessed
          <year>2019</year>
          /02/23.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>