<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Supporting hermeneutic interpretation of historical documents by computational methods</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cristina Vertan</string-name>
          <email>cristina.vertan@uni-hamburg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Hamburg Germany</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>77</fpage>
      <lpage>82</lpage>
      <abstract>
        <p>In this paper we will introduce a novel framework for data modelling which allows the implementation of tailored annotation tools for a specific digital humanities project. We will illustrate the generic framework model by means of two examples from completely different domains, each treating a different language: the construction of a diachronic corpus for classical Ethiopic texts and the computer-based analysis of originals and translations in three languages of historical documents from the 18th century.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Digitaization campaigns during the last ten years
have made available a considerable number of
historical texts. The first digitaization phase
concentrated on archiving purposes; thus the annotation
was focused on layout and editorial information.
The TEI standard includes dedicated modules for
this purpose. However, the next phase of digital
humanities implies active involvement of
computational methods for interpretation and fact
discovery within digital historical collections, i.e., active
computational support for the hermeneutic
interpretation.</p>
      <p>We argue that interpretation of historical
documents cannot be realised by simple black-box
algorithms which rely just on the graphical
representation of words but rather by:
1. Considering semantics, which implies a deep
annotation of text at several layers;
2. Explicitly annotating vague information;
3. Making use of non-crisp reasoning (e.g., fuzzy
logic, rough sets).</p>
      <p>For any high-level content analysis, the deep
annotation (manual, semi-automatic, or even
automatic) is an unavoidable process.</p>
      <p>For modern languages there are now established
standards and rich tools which ensure an easy
annotation process. In this contribution we want to
illustrate the challenges and special requirements
related to the annotation of historical texts; we
argue that in many cases the data model is so
complex that tools tailored to the corpus and/or the
language still have to be developed.</p>
      <p>The annotation of historical texts has to consider
following criteria:
– Original text is damaged and only the deep
annotation and interpretation of neighbouring
context can provide a possible reconstruction;
– The text is a transliteration from another
alphabet. In this case transliterations are rarely
standardised (also because historical language
was not standardised and spelling changes like
the insertion of vowels or the doubling of
consonants are subject to the interpretation of the
annotator and assignment of one or another
part-of-speech;
– The documents are a mixture of several
languages and OCR performs poorly.
• The annotation has to be done at several layers:
text structure, linguistic, domain-specific.
Annotations from different levels may overlap.
• All annotations should consider a degree of
imprecision, and vague assertions have to be
explicitly marked. Otherwise interpretations of
uncertain events may be distorted by crisp yes/no
decisions. Vagueness and uncertainty may lead
to different branches of the same annotation base.
• Original text and transliteration have to be both
kept and synchronised.
• Historical texts lack digital resources. Historical
language requires more features for annotation
than modern ones. Thus a fully automatic
(linguistic) annotation is in many cases impossible.
Manual annotation is time consuming, so that
functions allowing a controlled semi-automation
of the annotation process is more than desirable.
• The annotation tool has to be user-friendly as
annotators often lack extensive IT skills.</p>
      <p>
        As none of the currently available annotation
tools
        <xref ref-type="bibr" rid="ref1 ref3">(e.g., Bollmann et al., 2014; de Castilho et al.,
2016)</xref>
        fulfills all of the criteria listed above, many
projects decide to alter the data model insted, i.e.,
features of language, the text, or the domain are not
included in the annotation model. This has
consequences on the analysis and interpretation process.
      </p>
      <p>
        In this paper we will introduce a novel
framework for data modelling which allows for the
implementation of tailored annotation tools for
specific DH projects. We will illustrate the generic
framework model by means of two examples from
completely different domains, each treating
another language: first, the construction of a
diachronic corpus for classical Ethiopic texts
        <xref ref-type="bibr" rid="ref4">(Vertan
et al., 2016)</xref>
        and, second, the computer-based
analysis of originals and translations in three languages
of historical documents from the 18th century
        <xref ref-type="bibr" rid="ref6">(Vertan et al., 2017)</xref>
        . We will present the generic model
and show the derived data model for each of the
two examples and we will discuss the challenges
implied by the development of a new software.
We will illustrate also how interchangeability with
other digital resourced is assured.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Generic Data Model</title>
      <p>One of the main requirements of the annotation
process for historical data is the possibility of
changing the base text without losing the
annotation already performed. This requirement leads to
the idea that the characters composing the text to
be annotated and the annotation itself should be
independent one from another and considered as
features of an abstract object.</p>
      <p>Our generic model is organised around the
following notions</p>
      <sec id="sec-2-1">
        <title>1. Annotation Information (AI)</title>
        <p>2. Graphical Unit (GU)
3. Annotation Unit (AU)</p>
      </sec>
      <sec id="sec-2-2">
        <title>4. Annotation Span (AS)</title>
        <p>5. Annotation Level (AL)</p>
        <p>An Annotation has two components: An
Annotation Tag (e.g., part of speech) and an optional
number of features, recorded as [Attribute, Value]
pair (e.g., [Gender, Masculine], [Number, Plural]).</p>
        <p>A Graphical Unit is the smallest unit one can
select with one single operation (mouse click or
key combination).</p>
        <p>An Annotation Unit is any subcomponent of a
GU which can hold an annotation. An
Annotation Unit can include one or more other annotation
units. There are cases in which the AU is identical
(from the point of view of borders) with the GU.</p>
        <p>An Annotation Span is an object holding an
annotation and containing at least two AUs.
Belonging to two GUs.</p>
        <p>Each of these objects can have a label
denominating them in the text. For example, a GU is a
word in a text, an AU is each letter of the word and
a sentence is modelled as an AS. In this way
operations on the labels of one object (insertion
deletions, replacements of characters) do not affect the
already inserted annotation for the respective
object.</p>
        <p>Links between AU and/or AS objects are
ensured through unique IDs. In this way, the model
enables also the annotation of discontinuous
elements (e.g., a named entity which does not contain
adjacent tokens).</p>
        <p>An Annotation Level is a list of annotation units
and annotation spans. The allowed annotation tags
belong to a closed list unique for each annotation
level.</p>
        <p>An annotated text contains one or more
annotation levels.</p>
        <p>One can differentiate two models which have to
be defined:
1. The model for the units which have to be
annotated: the graphical units and the annotation
units
2. The annotation model, namely the annotation
levels, the annotation information allowed for
each level as well as the annotation spans
In the next sections we will illustrate through
two examples how this model works in practice. In
the first example in Section 3 it was implemented
for deep annotation for the classical Ethiopic
language. In the second example in Section 4 we will
show how this framework is currently used as well
3
3.1
for the annotation of linguistic and factual
vagueness in texts.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Annotation of Classical Ethiopic Texts</title>
      <sec id="sec-3-1">
        <title>Particularities of Classical Ethiopic</title>
        <p>
          Classical Ethiopic (Gǝʿz), belongs to the south
Semitic language family. Until the end of the
19th century was one of the most important
written language of Christian Ethiopia.
Chronologically at the beginning, the rich Christian Ethiopic
literature was strongly influenced by translations
from Greek and later from Arabic. Later texts
develop a local indigenous style. The language plays
an important role for the European cultural
heritage: early Christian texts, lost or preserved badly
or in fragments in other languages are
transmitted entirely in classical Ethiopic (e.g., The book
of Henoch)
          <xref ref-type="bibr" rid="ref4">(Vertan et al., 2016)</xref>
          .
        </p>
        <p>Gǝʿz has its own alphabet developed from the
south Semitic script. It is a syllable script used
also nowadays by several languages from Ethiopia
and Eritrea (e.g. Amharic, Tigrinya). A particular
feature for the Semitic language family is the
leftto-right language direction. Also in contrast with
most other Semitic languages it is completely
vocalized (i.e., the vowels are always written). This
leads also to the problem that morphemes
boundaries cannot be visualised. Sometimes only the
vowel within a syllable represents a part of speech
and has to be tokenised and annotated (e.g., in the
word ቤቱ፡ ‘his house’ /be·tu/ the /u/ is a
pronominal suffix and the tokenisation is thus bet-u.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Annotation Challenges</title>
        <p>Such annotation can be done only on the
transcription level. Annotations at other levels (e.g., text
divisions, editiorial markup) have to be done on
the original script. This implies that original and
transcription have to be fully synchronised in the
annotation tool.</p>
        <p>The transcription of the original script can
follow a rule-based approach. In contrast the
transliteration (e.g., doubling a consonant) can be done
on the basis of the transcription, just manually. In
many cases the correct transliteration can be
decided only after morphological analysis and
disambiguation. Thus the annotation tool has to be
robust in the face of changes of the text during the
annotation process. This is a very important
feature but also an enormous challenge for any
annotation tool.</p>
        <p>A diachronic language analysis (as it is required
in order to see the development over centuries of
classical Ethiopic) can be done only if the
linguistic analysis is deep. Usually changes in the
language can be observed first in detail and then at a
macro level. For classical Ethiopic the linguistic
POS tagset has 33 elements, each with a number
of features.</p>
        <p>Given the fact that no training data exist, a
manual annotation is unavoidable. However, the tool
we developed provides a mechanism of controlled
automatic annotation, which at one hand speeds up
the process and on the other hand leaves the final
decision on disambiguation to the user.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>The Annotation Model</title>
        <p>A Graphical Unit (GU) represents a sequence of
Gǝʿz characters ending with the Gǝʿz separator ፡.
The punctuation mark ። is always considered a
GU. Tokens are the smallest annotatable units with
a meaning of their own to which a lemma can be
assigned. Token objects are composed of several
transcription letter objects</p>
        <p>For example, the GU object ወይቤሎ፡
represents also an Annotation Unit and contains the 4
Gǝʿz letter objects modelled as AUs; ወ, ይ, ቤ, ሎ.
Each of these objects contains the corresponding
transcription letter objects modelled also as AUs,
namely:
• ወ contains the transcription letter objects: w and
a
• ይ contains the transcription letter objects: y and
ǝ
• ቤ contains the transcription letter objects: b and
e
• ሎ contains the transcription letter objects: l and
o</p>
        <p>Throughout the transliteration-tokenisation
phase, three token objects (in our model also AUs)
are built: wa, yǝbel, and o.</p>
        <p>Finally, the initial GU object will have attached
two labels: ወይቤሎ and wa-yǝbel-o. For
synchronisation reasons we consider the word separator ፡
as property attached to the Gǝʿz character object
ሎ. Each Token-Object records the IDs of the
transcription letter object that it contains.</p>
        <p>Morphological annotation objects are attached
to one token object. They consist of a tag (e.g., the
POS “Common Noun”) and a list of attribute-value
pairs where the key is the name of the
morphological feature (e.g., number). In this way, the tool is
robust with respect to the addition of new
morphological features or POS tags.</p>
        <p>As the correspondences between the
Gǝʿzcharacter and the transcriptions are unique, the
system stores just the labels of the
Transcriptionletter objects. All other object labels (Token,
Gǝʿz-character and GU) are dynamically
generated throughout a given correspondence table and
the Ids. In this way the system uses less memory
and it remains error prone during the
transliteration process. In Figure 1 we present the entire data
model, including also the other possible annotation
levels. The GeTa-tool implementing this model is
a client-application, written in Java and distributed
as open-source software.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Annotation of vagueness and uncertainty in historical texts</title>
      <p>The second example discusses the annotation of
historical texts from the 18th century for which we
want to mark:
1. Uncertain characters or words (not entirely
deciphered from the manuscript);
2. Uncertain dates, places persons and if possible
their mapping on a corresponding knowledge
base;
3. Vague linguistic expressions;
4. Indicators for source quotations;
5. Text structure;
6. Linguistic annotation.</p>
      <p>We define six Annotation Levels. The
Graphical Unit is a word in the text, i.e., a string
delimited by spaces. Punctuation is separated in a
preprocessing step as independent words.</p>
      <p>Annotation Units are words, a single letter or
a group of letters inside one word. Annotation
spans will be in this case necessary for representing
named entities (places, persons, etc.), text
structure, or vague linguistic expressions. Especially
for vague expressions it is extremely important that
the model supports discontinuous elements to be
part of the same annotation.</p>
      <p>To each Annotation Span or Annotation Unit
we attach Annotation Information containing
Attribute–Value pairs related to the degree of
uncertainty (fuzzy value), type of linguistic
vagueness and source of quotation, respectively, and the
trust value of this source. An example of such
Annotation is presented in Figure 2.</p>
      <p>The aim of such annotations is not to develop
an expert system in the classic way, as known from
artificial intelligence. Such expert systems assume
that the computer is reasoning and presents its
interpretation to the user. We consider that for
interpretation of historical facts such system is not
reliable enough. The background knowledge
necessary for producing reliable result is huge and relies
often either on materials which are not available in
digital form. Thus our goal is rather to make the
user aware that:
1. There is a number of possible answers to one
query, and
2. these possible answers may have different
degrees of reliability (i.e., they are not necessarily
true).</p>
      <p>The interpretation and the final decision is left
entirely to the user.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The annotation model introduced in Section 2 and
exemplified in Sections 3 and 4 is flexible and
supports changes of the text to be annotated during
the annotation process. Of course the results of
these changes must remain consistent with the
annotation. This is the responsibility of the annotator
(i.e., if the user changes completely the label of an
Annotation Unit he must ask himself if the new
label still corresponds to the annotation). In the
particular examples presented in Sections 3 and 4 we
encode the model as JSON objects. This allows us
to keep the required storage space small and
ensures fast access to the data. However, we provide
export to other, in particular XML-based, formats,
which ensures interoperability with other analysis
tools such as ANNIS or Voyant. Further work
included the implementation of the generic model for
the annotation of inscriptions of Classical Maya.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>This article presents work performed within two
projects: the work in Section 3 was performed
with the TraCES project (From Translation to
Creation: Changes in the Ethiopic Lexicon and Style
from Late Antiquity to the Middle Ages) supported
by the European Research Council. Work
performed in this project was performed together with
Alessandro Bausi, Wolfgang Dickhut, Andreas
Ellwardt, Susanne Hummel, Vitagrazia Pissani,
and Eugenia Sokolinski. The work in Section 4
is currently performed within the project
HerCoRe (Hermeneutic and Computer-based Analysis
of Reliability, Consistency and Vagueness in
historical Texts) funded by the Volkswagen
Foundation within the framework “Mixed Methods in
Humanities”). Work reported in this section was done
in collaboration with Walther v. Hahn and Alptug
Güney.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bollmann</surname>
          </string-name>
          , Marcel, Florian Petran, Stefanie Dipper, and
          <string-name>
            <surname>Julia Krasselt</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>CorA: A web-based annotation tool for historical and other non-standard language data</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>In Proceedings of the 8th Workshop on Language Technology for Cultural Heritage</source>
          ,
          <source>Social Sciences, and Humanities (LaTeCH</source>
          <year>2014</year>
          ), pages
          <fpage>86</fpage>
          -
          <lpage>90</lpage>
          . URL https: //aclweb.org/anthology/W14-0612.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          de Castilho, Richard Eckart, Éva MújdriczaMaydt, Seid Muhie Yimam, Silvana Hartmann, Iryna Gurevych, Anette Frank, and
          <string-name>
            <surname>Chris Biemann</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>A web-based tool for the integrated annotation of semantic and syntactic structures</article-title>
          .
          <source>In Proceedings of the LT4DH workshop at COLING</source>
          <year>2016</year>
          , pages
          <fpage>76</fpage>
          -
          <lpage>84</lpage>
          . URL https://aclweb.org/anthology/W16-4011.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Vertan</surname>
            , Cristina,
            <given-names>Andreas</given-names>
          </string-name>
          <string-name>
            <surname>Ellwardt</surname>
          </string-name>
          , and
          <string-name>
            <surname>Susanne Hummel</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Ein Mehrebenen-TaggingModell für die Annotation altäthiopischer Texte</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>In Proceedings der DHd-Konferenz</source>
          <year>2016</year>
          . URL http://www.dhd2016.de/abstracts/vortr% C3%
          <fpage>A4ge</fpage>
          -
          <lpage>061</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Vertan</surname>
          </string-name>
          , Cristina, Walther von Hahn, and
          <string-name>
            <surname>Anca Dinu</surname>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>On the annotation of vague expressions: a case study on Romanian historical texts</article-title>
          .
          <source>In Proceedings of the first Workshop on Language Technology for Digital Humanities in Central</source>
          and (South-)
          <source>Eastern Europe, in association with RANLP</source>
          <year>2017</year>
          , pages
          <fpage>24</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>doi:10</source>
          .26615/
          <fpage>978</fpage>
          -954-452-049-6_
          <fpage>028</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>