=Paper=
{{Paper
|id=Vol-2314/paper8
|storemode=property
|title=Supporting hermeneutic interpretation of historical documents by computational methods
|pdfUrl=https://ceur-ws.org/Vol-2314/paper8.pdf
|volume=Vol-2314
|authors=Cristina Vertan
|dblpUrl=https://dblp.org/rec/conf/comhum/Vertan18
}}
==Supporting hermeneutic interpretation of historical documents by computational methods==
<pdf width="1500px">https://ceur-ws.org/Vol-2314/paper8.pdf</pdf>
<pre>
      Supporting hermeneutic interpretation of historical documents by
                        computational methods

                                           Cristina Vertan
                                        University of Hamburg
                                              Germany
                                  cristina.vertan@uni-hamburg.de


                     Abstract                                     For any high-level content analysis, the deep an-
                                                               notation (manual, semi-automatic, or even auto-
    In this paper we will introduce a novel                    matic) is an unavoidable process.
    framework for data modelling which al-                        For modern languages there are now established
    lows the implementation of tailored anno-                  standards and rich tools which ensure an easy an-
    tation tools for a specific digital humani-                notation process. In this contribution we want to
    ties project. We will illustrate the generic               illustrate the challenges and special requirements
    framework model by means of two exam-                      related to the annotation of historical texts; we ar-
    ples from completely different domains,                    gue that in many cases the data model is so com-
    each treating a different language: the con-               plex that tools tailored to the corpus and/or the lan-
    struction of a diachronic corpus for classi-               guage still have to be developed.
    cal Ethiopic texts and the computer-based                     The annotation of historical texts has to consider
    analysis of originals and translations in                  following criteria:
    three languages of historical documents
    from the 18th century.                                     • The text to be annotated may change during the
                                                                 annotation. Several scenarios may converge to
1   Introduction                                                 this situation:
Digitaization campaigns during the last ten years                – Original text is damaged and only the deep
have made available a considerable number of his-                  annotation and interpretation of neighbouring
torical texts. The first digitaization phase concen-               context can provide a possible reconstruction;
trated on archiving purposes; thus the annotation                – The text is a transliteration from another al-
was focused on layout and editorial information.                   phabet. In this case transliterations are rarely
The TEI standard includes dedicated modules for                    standardised (also because historical language
this purpose. However, the next phase of digital                   was not standardised and spelling changes like
humanities implies active involvement of compu-                    the insertion of vowels or the doubling of con-
tational methods for interpretation and fact discov-               sonants are subject to the interpretation of the
ery within digital historical collections, i.e., active            annotator and assignment of one or another
computational support for the hermeneutic inter-                   part-of-speech;
pretation.                                                       – The documents are a mixture of several lan-
   We argue that interpretation of historical docu-                guages and OCR performs poorly.
ments cannot be realised by simple black-box al-               • The annotation has to be done at several layers:
gorithms which rely just on the graphical represen-              text structure, linguistic, domain-specific. An-
tation of words but rather by:                                   notations from different levels may overlap.
                                                               • All annotations should consider a degree of im-
1. Considering semantics, which implies a deep                   precision, and vague assertions have to be ex-
   annotation of text at several layers;                         plicitly marked. Otherwise interpretations of un-
2. Explicitly annotating vague information;                      certain events may be distorted by crisp yes/no
3. Making use of non-crisp reasoning (e.g., fuzzy                decisions. Vagueness and uncertainty may lead
   logic, rough sets).                                           to different branches of the same annotation base.

                                                          77
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


• Original text and transliteration have to be both            4. Annotation Span (AS)
  kept and synchronised.                                       5. Annotation Level (AL)
• Historical texts lack digital resources. Historical
  language requires more features for annotation                  An Annotation has two components: An Anno-
  than modern ones. Thus a fully automatic (lin-               tation Tag (e.g., part of speech) and an optional
  guistic) annotation is in many cases impossible.             number of features, recorded as [Attribute, Value]
  Manual annotation is time consuming, so that                 pair (e.g., [Gender, Masculine], [Number, Plural]).
  functions allowing a controlled semi-automation                 A Graphical Unit is the smallest unit one can
  of the annotation process is more than desirable.            select with one single operation (mouse click or
• The annotation tool has to be user-friendly as an-           key combination).
  notators often lack extensive IT skills.                        An Annotation Unit is any subcomponent of a
                                                               GU which can hold an annotation. An Annota-
   As none of the currently available annotation               tion Unit can include one or more other annotation
tools (e.g., Bollmann et al., 2014; de Castilho et al.,        units. There are cases in which the AU is identical
2016) fulfills all of the criteria listed above, many          (from the point of view of borders) with the GU.
projects decide to alter the data model insted, i.e.,             An Annotation Span is an object holding an an-
features of language, the text, or the domain are not          notation and containing at least two AUs. Belong-
included in the annotation model. This has conse-              ing to two GUs.
quences on the analysis and interpretation process.               Each of these objects can have a label denom-
   In this paper we will introduce a novel frame-              inating them in the text. For example, a GU is a
work for data modelling which allows for the im-               word in a text, an AU is each letter of the word and
plementation of tailored annotation tools for spe-             a sentence is modelled as an AS. In this way oper-
cific DH projects. We will illustrate the generic              ations on the labels of one object (insertion dele-
framework model by means of two examples from                  tions, replacements of characters) do not affect the
completely different domains, each treating an-                already inserted annotation for the respective ob-
other language: first, the construction of a di-               ject.
achronic corpus for classical Ethiopic texts (Vertan              Links between AU and/or AS objects are en-
et al., 2016) and, second, the computer-based anal-            sured through unique IDs. In this way, the model
ysis of originals and translations in three languages          enables also the annotation of discontinuous ele-
of historical documents from the 18th century (Ver-            ments (e.g., a named entity which does not contain
tan et al., 2017). We will present the generic model           adjacent tokens).
and show the derived data model for each of the                   An Annotation Level is a list of annotation units
two examples and we will discuss the challenges                and annotation spans. The allowed annotation tags
implied by the development of a new software.                  belong to a closed list unique for each annotation
We will illustrate also how interchangeability with            level.
other digital resourced is assured.                               An annotated text contains one or more annota-
                                                               tion levels.
2   Generic Data Model                                            One can differentiate two models which have to
                                                               be defined:
One of the main requirements of the annotation
process for historical data is the possibility of              1. The model for the units which have to be an-
changing the base text without losing the annota-                 notated: the graphical units and the annotation
tion already performed. This requirement leads to                 units
the idea that the characters composing the text to             2. The annotation model, namely the annotation
be annotated and the annotation itself should be in-              levels, the annotation information allowed for
dependent one from another and considered as fea-                 each level as well as the annotation spans
tures of an abstract object.
   Our generic model is organised around the fol-                 In the next sections we will illustrate through
lowing notions                                                 two examples how this model works in practice. In
                                                               the first example in Section 3 it was implemented
1. Annotation Information (AI)                                 for deep annotation for the classical Ethiopic lan-
2. Graphical Unit (GU)                                         guage. In the second example in Section 4 we will
3. Annotation Unit (AU)                                        show how this framework is currently used as well

                                                          78
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


for the annotation of linguistic and factual vague-             A diachronic language analysis (as it is required
ness in texts.                                               in order to see the development over centuries of
                                                             classical Ethiopic) can be done only if the linguis-
3     Annotation of Classical Ethiopic Texts                 tic analysis is deep. Usually changes in the lan-
                                                             guage can be observed first in detail and then at a
3.1    Particularities of Classical Ethiopic
                                                             macro level. For classical Ethiopic the linguistic
Classical Ethiopic (Gǝʿz), belongs to the south              POS tagset has 33 elements, each with a number
Semitic language family. Until the end of the                of features.
19th century was one of the most important writ-                Given the fact that no training data exist, a man-
ten language of Christian Ethiopia. Chronologi-              ual annotation is unavoidable. However, the tool
cally at the beginning, the rich Christian Ethiopic          we developed provides a mechanism of controlled
literature was strongly influenced by translations           automatic annotation, which at one hand speeds up
from Greek and later from Arabic. Later texts de-            the process and on the other hand leaves the final
velop a local indigenous style. The language plays           decision on disambiguation to the user.
an important role for the European cultural her-
itage: early Christian texts, lost or preserved badly        3.3 The Annotation Model
or in fragments in other languages are transmit-             A Graphical Unit (GU) represents a sequence of
ted entirely in classical Ethiopic (e.g., The book           Gǝʿz characters ending with the Gǝʿz separator ፡.
of Henoch) (Vertan et al., 2016).                            The punctuation mark ። is always considered a
   Gǝʿz has its own alphabet developed from the              GU. Tokens are the smallest annotatable units with
south Semitic script. It is a syllable script used           a meaning of their own to which a lemma can be
also nowadays by several languages from Ethiopia             assigned. Token objects are composed of several
and Eritrea (e.g. Amharic, Tigrinya). A particular           transcription letter objects
feature for the Semitic language family is the left-            For example, the GU object ወይቤሎ፡ repre-
to-right language direction. Also in contrast with           sents also an Annotation Unit and contains the 4
most other Semitic languages it is completely vo-            Gǝʿz letter objects modelled as AUs; ወ, ይ, ቤ, ሎ.
calized (i.e., the vowels are always written). This          Each of these objects contains the corresponding
leads also to the problem that morphemes bound-              transcription letter objects modelled also as AUs,
aries cannot be visualised. Sometimes only the               namely:
vowel within a syllable represents a part of speech
                                                             • ወ contains the transcription letter objects: w and
and has to be tokenised and annotated (e.g., in the
                                                               a
word ቤቱ፡ ‘his house’ /be·tu/ the /u/ is a pronom-
                                                             • ይ contains the transcription letter objects: y and
inal suffix and the tokenisation is thus bet-u.
                                                               ǝ
3.2    Annotation Challenges                                 • ቤ contains the transcription letter objects: b and
                                                               e
Such annotation can be done only on the transcrip-
                                                             • ሎ contains the transcription letter objects: l and
tion level. Annotations at other levels (e.g., text
                                                               o
divisions, editiorial markup) have to be done on
the original script. This implies that original and             Throughout the transliteration-tokenisation
transcription have to be fully synchronised in the           phase, three token objects (in our model also AUs)
annotation tool.                                             are built: wa, yǝbel, and o.
   The transcription of the original script can fol-            Finally, the initial GU object will have attached
low a rule-based approach. In contrast the translit-         two labels: ወይቤሎ and wa-yǝbel-o. For synchro-
eration (e.g., doubling a consonant) can be done             nisation reasons we consider the word separator ፡
on the basis of the transcription, just manually. In         as property attached to the Gǝʿz character object
many cases the correct transliteration can be de-            ሎ. Each Token-Object records the IDs of the tran-
cided only after morphological analysis and dis-             scription letter object that it contains.
ambiguation. Thus the annotation tool has to be                 Morphological annotation objects are attached
robust in the face of changes of the text during the         to one token object. They consist of a tag (e.g., the
annotation process. This is a very important fea-            POS “Common Noun”) and a list of attribute-value
ture but also an enormous challenge for any anno-            pairs where the key is the name of the morpholog-
tation tool.                                                 ical feature (e.g., number). In this way, the tool is

                                                        79
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


                            Figure 1: Annotation Model for classical Ethiopic


robust with respect to the addition of new morpho-           2. Uncertain dates, places persons and if possible
logical features or POS tags.                                   their mapping on a corresponding knowledge
   As the correspondences between the Gǝʿz-                     base;
character and the transcriptions are unique, the             3. Vague linguistic expressions;
system stores just the labels of the Transcription-          4. Indicators for source quotations;
letter objects. All other object labels (Token,              5. Text structure;
Gǝʿz-character and GU) are dynamically gener-                6. Linguistic annotation.
ated throughout a given correspondence table and
the Ids. In this way the system uses less memory                We define six Annotation Levels. The Graphi-
and it remains error prone during the translitera-           cal Unit is a word in the text, i.e., a string delim-
tion process. In Figure 1 we present the entire data         ited by spaces. Punctuation is separated in a pre-
model, including also the other possible annotation          processing step as independent words.
levels. The GeTa-tool implementing this model is
                                                                Annotation Units are words, a single letter or
a client-application, written in Java and distributed
                                                             a group of letters inside one word. Annotation
as open-source software.
                                                             spans will be in this case necessary for representing
                                                             named entities (places, persons, etc.), text struc-
4   Annotation of vagueness and                              ture, or vague linguistic expressions. Especially
    uncertainty in historical texts                          for vague expressions it is extremely important that
                                                             the model supports discontinuous elements to be
The second example discusses the annotation of               part of the same annotation.
historical texts from the 18th century for which we
                                                                To each Annotation Span or Annotation Unit
want to mark:
                                                             we attach Annotation Information containing
                                                             Attribute–Value pairs related to the degree of un-
1. Uncertain characters or words (not entirely de-           certainty (fuzzy value), type of linguistic vague-
   ciphered from the manuscript);                            ness and source of quotation, respectively, and the

                                                        80
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


                                  Figure 2: Annotation Model in HerCoRe


trust value of this source. An example of such An-             the annotation process. Of course the results of
notation is presented in Figure 2.                             these changes must remain consistent with the an-
   The aim of such annotations is not to develop               notation. This is the responsibility of the annotator
an expert system in the classic way, as known from             (i.e., if the user changes completely the label of an
artificial intelligence. Such expert systems assume            Annotation Unit he must ask himself if the new la-
that the computer is reasoning and presents its in-            bel still corresponds to the annotation). In the par-
terpretation to the user. We consider that for inter-          ticular examples presented in Sections 3 and 4 we
pretation of historical facts such system is not re-           encode the model as JSON objects. This allows us
liable enough. The background knowledge neces-                 to keep the required storage space small and en-
sary for producing reliable result is huge and relies          sures fast access to the data. However, we provide
often either on materials which are not available in           export to other, in particular XML-based, formats,
digital form. Thus our goal is rather to make the              which ensures interoperability with other analysis
user aware that:                                               tools such as ANNIS or Voyant. Further work in-
1. There is a number of possible answers to one                cluded the implementation of the generic model for
   query, and                                                  the annotation of inscriptions of Classical Maya.
2. these possible answers may have different de-
   grees of reliability (i.e., they are not necessarily        Acknowledgements
   true).
                                                               This article presents work performed within two
The interpretation and the final decision is left en-          projects: the work in Section 3 was performed
tirely to the user.                                            with the TraCES project (From Translation to Cre-
                                                               ation: Changes in the Ethiopic Lexicon and Style
5   Conclusions
                                                               from Late Antiquity to the Middle Ages) supported
The annotation model introduced in Section 2 and               by the European Research Council. Work per-
exemplified in Sections 3 and 4 is flexible and sup-           formed in this project was performed together with
ports changes of the text to be annotated during               Alessandro Bausi, Wolfgang Dickhut, Andreas

                                                          81
Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018)


Ellwardt, Susanne Hummel, Vitagrazia Pissani,                    de Castilho, Richard Eckart, Éva Mújdricza-
and Eugenia Sokolinski. The work in Section 4                    Maydt, Seid Muhie Yimam, Silvana Hartmann,
                                                                 Iryna Gurevych, Anette Frank, and Chris Bie-
is currently performed within the project Her-
                                                                 mann (2016).      A web-based tool for the in-
CoRe (Hermeneutic and Computer-based Analysis                    tegrated annotation of semantic and syntactic
of Reliability, Consistency and Vagueness in his-                structures. In Proceedings of the LT4DH work-
torical Texts) funded by the Volkswagen Founda-                  shop at COLING 2016, pages 76–84.         URL
tion within the framework “Mixed Methods in Hu-                  https://aclweb.org/anthology/W16-4011.
manities”). Work reported in this section was done               Vertan, Cristina, Andreas Ellwardt, and Susanne
in collaboration with Walther v. Hahn and Alptug                 Hummel (2016).         Ein Mehrebenen-Tagging-
Güney.                                                           Modell für die Annotation altäthiopischer Texte.
                                                                 In Proceedings der DHd-Konferenz 2016.     URL
                                                                 http://www.dhd2016.de/abstracts/vortr%
                                                                 C3%A4ge-061.html.
References
Bollmann, Marcel, Florian Petran, Stefanie Dipper, and           Vertan, Cristina, Walther von Hahn, and Anca Dinu
Julia Krasselt (2014). CorA: A web-based annotation              (2017). On the annotation of vague expressions: a
tool for historical and other non-standard language data.        case study on Romanian historical texts. In Proceed-
In Proceedings of the 8th Workshop on Language Tech-             ings of the first Workshop on Language Technology for
nology for Cultural Heritage, Social Sciences, and Hu-           Digital Humanities in Central and (South-)Eastern Eu-
manities (LaTeCH 2014), pages 86–90. URL https:                  rope, in association with RANLP 2017, pages 24–31.
//aclweb.org/anthology/W14-0612.                                 doi:10.26615/978-954-452-049-6_028.


                                                            82

</pre>