=Paper=
{{Paper
|id=Vol-2314/paper8
|storemode=property
|title=Supporting hermeneutic interpretation of historical documents by computational methods
|pdfUrl=https://ceur-ws.org/Vol-2314/paper8.pdf
|volume=Vol-2314
|authors=Cristina Vertan
|dblpUrl=https://dblp.org/rec/conf/comhum/Vertan18
}}
==Supporting hermeneutic interpretation of historical documents by computational methods==
Supporting hermeneutic interpretation of historical documents by computational methods Cristina Vertan University of Hamburg Germany cristina.vertan@uni-hamburg.de Abstract For any high-level content analysis, the deep an- notation (manual, semi-automatic, or even auto- In this paper we will introduce a novel matic) is an unavoidable process. framework for data modelling which al- For modern languages there are now established lows the implementation of tailored anno- standards and rich tools which ensure an easy an- tation tools for a specific digital humani- notation process. In this contribution we want to ties project. We will illustrate the generic illustrate the challenges and special requirements framework model by means of two exam- related to the annotation of historical texts; we ar- ples from completely different domains, gue that in many cases the data model is so com- each treating a different language: the con- plex that tools tailored to the corpus and/or the lan- struction of a diachronic corpus for classi- guage still have to be developed. cal Ethiopic texts and the computer-based The annotation of historical texts has to consider analysis of originals and translations in following criteria: three languages of historical documents from the 18th century. • The text to be annotated may change during the annotation. Several scenarios may converge to 1 Introduction this situation: Digitaization campaigns during the last ten years – Original text is damaged and only the deep have made available a considerable number of his- annotation and interpretation of neighbouring torical texts. The first digitaization phase concen- context can provide a possible reconstruction; trated on archiving purposes; thus the annotation – The text is a transliteration from another al- was focused on layout and editorial information. phabet. In this case transliterations are rarely The TEI standard includes dedicated modules for standardised (also because historical language this purpose. However, the next phase of digital was not standardised and spelling changes like humanities implies active involvement of compu- the insertion of vowels or the doubling of con- tational methods for interpretation and fact discov- sonants are subject to the interpretation of the ery within digital historical collections, i.e., active annotator and assignment of one or another computational support for the hermeneutic inter- part-of-speech; pretation. – The documents are a mixture of several lan- We argue that interpretation of historical docu- guages and OCR performs poorly. ments cannot be realised by simple black-box al- • The annotation has to be done at several layers: gorithms which rely just on the graphical represen- text structure, linguistic, domain-specific. An- tation of words but rather by: notations from different levels may overlap. • All annotations should consider a degree of im- 1. Considering semantics, which implies a deep precision, and vague assertions have to be ex- annotation of text at several layers; plicitly marked. Otherwise interpretations of un- 2. Explicitly annotating vague information; certain events may be distorted by crisp yes/no 3. Making use of non-crisp reasoning (e.g., fuzzy decisions. Vagueness and uncertainty may lead logic, rough sets). to different branches of the same annotation base. 77 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) • Original text and transliteration have to be both 4. Annotation Span (AS) kept and synchronised. 5. Annotation Level (AL) • Historical texts lack digital resources. Historical language requires more features for annotation An Annotation has two components: An Anno- than modern ones. Thus a fully automatic (lin- tation Tag (e.g., part of speech) and an optional guistic) annotation is in many cases impossible. number of features, recorded as [Attribute, Value] Manual annotation is time consuming, so that pair (e.g., [Gender, Masculine], [Number, Plural]). functions allowing a controlled semi-automation A Graphical Unit is the smallest unit one can of the annotation process is more than desirable. select with one single operation (mouse click or • The annotation tool has to be user-friendly as an- key combination). notators often lack extensive IT skills. An Annotation Unit is any subcomponent of a GU which can hold an annotation. An Annota- As none of the currently available annotation tion Unit can include one or more other annotation tools (e.g., Bollmann et al., 2014; de Castilho et al., units. There are cases in which the AU is identical 2016) fulfills all of the criteria listed above, many (from the point of view of borders) with the GU. projects decide to alter the data model insted, i.e., An Annotation Span is an object holding an an- features of language, the text, or the domain are not notation and containing at least two AUs. Belong- included in the annotation model. This has conse- ing to two GUs. quences on the analysis and interpretation process. Each of these objects can have a label denom- In this paper we will introduce a novel frame- inating them in the text. For example, a GU is a work for data modelling which allows for the im- word in a text, an AU is each letter of the word and plementation of tailored annotation tools for spe- a sentence is modelled as an AS. In this way oper- cific DH projects. We will illustrate the generic ations on the labels of one object (insertion dele- framework model by means of two examples from tions, replacements of characters) do not affect the completely different domains, each treating an- already inserted annotation for the respective ob- other language: first, the construction of a di- ject. achronic corpus for classical Ethiopic texts (Vertan Links between AU and/or AS objects are en- et al., 2016) and, second, the computer-based anal- sured through unique IDs. In this way, the model ysis of originals and translations in three languages enables also the annotation of discontinuous ele- of historical documents from the 18th century (Ver- ments (e.g., a named entity which does not contain tan et al., 2017). We will present the generic model adjacent tokens). and show the derived data model for each of the An Annotation Level is a list of annotation units two examples and we will discuss the challenges and annotation spans. The allowed annotation tags implied by the development of a new software. belong to a closed list unique for each annotation We will illustrate also how interchangeability with level. other digital resourced is assured. An annotated text contains one or more annota- tion levels. 2 Generic Data Model One can differentiate two models which have to be defined: One of the main requirements of the annotation process for historical data is the possibility of 1. The model for the units which have to be an- changing the base text without losing the annota- notated: the graphical units and the annotation tion already performed. This requirement leads to units the idea that the characters composing the text to 2. The annotation model, namely the annotation be annotated and the annotation itself should be in- levels, the annotation information allowed for dependent one from another and considered as fea- each level as well as the annotation spans tures of an abstract object. Our generic model is organised around the fol- In the next sections we will illustrate through lowing notions two examples how this model works in practice. In the first example in Section 3 it was implemented 1. Annotation Information (AI) for deep annotation for the classical Ethiopic lan- 2. Graphical Unit (GU) guage. In the second example in Section 4 we will 3. Annotation Unit (AU) show how this framework is currently used as well 78 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) for the annotation of linguistic and factual vague- A diachronic language analysis (as it is required ness in texts. in order to see the development over centuries of classical Ethiopic) can be done only if the linguis- 3 Annotation of Classical Ethiopic Texts tic analysis is deep. Usually changes in the lan- guage can be observed first in detail and then at a 3.1 Particularities of Classical Ethiopic macro level. For classical Ethiopic the linguistic Classical Ethiopic (Gǝʿz), belongs to the south POS tagset has 33 elements, each with a number Semitic language family. Until the end of the of features. 19th century was one of the most important writ- Given the fact that no training data exist, a man- ten language of Christian Ethiopia. Chronologi- ual annotation is unavoidable. However, the tool cally at the beginning, the rich Christian Ethiopic we developed provides a mechanism of controlled literature was strongly influenced by translations automatic annotation, which at one hand speeds up from Greek and later from Arabic. Later texts de- the process and on the other hand leaves the final velop a local indigenous style. The language plays decision on disambiguation to the user. an important role for the European cultural her- itage: early Christian texts, lost or preserved badly 3.3 The Annotation Model or in fragments in other languages are transmit- A Graphical Unit (GU) represents a sequence of ted entirely in classical Ethiopic (e.g., The book Gǝʿz characters ending with the Gǝʿz separator ፡. of Henoch) (Vertan et al., 2016). The punctuation mark ። is always considered a Gǝʿz has its own alphabet developed from the GU. Tokens are the smallest annotatable units with south Semitic script. It is a syllable script used a meaning of their own to which a lemma can be also nowadays by several languages from Ethiopia assigned. Token objects are composed of several and Eritrea (e.g. Amharic, Tigrinya). A particular transcription letter objects feature for the Semitic language family is the left- For example, the GU object ወይቤሎ፡ repre- to-right language direction. Also in contrast with sents also an Annotation Unit and contains the 4 most other Semitic languages it is completely vo- Gǝʿz letter objects modelled as AUs; ወ, ይ, ቤ, ሎ. calized (i.e., the vowels are always written). This Each of these objects contains the corresponding leads also to the problem that morphemes bound- transcription letter objects modelled also as AUs, aries cannot be visualised. Sometimes only the namely: vowel within a syllable represents a part of speech • ወ contains the transcription letter objects: w and and has to be tokenised and annotated (e.g., in the a word ቤቱ፡ ‘his house’ /be·tu/ the /u/ is a pronom- • ይ contains the transcription letter objects: y and inal suffix and the tokenisation is thus bet-u. ǝ 3.2 Annotation Challenges • ቤ contains the transcription letter objects: b and e Such annotation can be done only on the transcrip- • ሎ contains the transcription letter objects: l and tion level. Annotations at other levels (e.g., text o divisions, editiorial markup) have to be done on the original script. This implies that original and Throughout the transliteration-tokenisation transcription have to be fully synchronised in the phase, three token objects (in our model also AUs) annotation tool. are built: wa, yǝbel, and o. The transcription of the original script can fol- Finally, the initial GU object will have attached low a rule-based approach. In contrast the translit- two labels: ወይቤሎ and wa-yǝbel-o. For synchro- eration (e.g., doubling a consonant) can be done nisation reasons we consider the word separator ፡ on the basis of the transcription, just manually. In as property attached to the Gǝʿz character object many cases the correct transliteration can be de- ሎ. Each Token-Object records the IDs of the tran- cided only after morphological analysis and dis- scription letter object that it contains. ambiguation. Thus the annotation tool has to be Morphological annotation objects are attached robust in the face of changes of the text during the to one token object. They consist of a tag (e.g., the annotation process. This is a very important fea- POS “Common Noun”) and a list of attribute-value ture but also an enormous challenge for any anno- pairs where the key is the name of the morpholog- tation tool. ical feature (e.g., number). In this way, the tool is 79 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) Figure 1: Annotation Model for classical Ethiopic robust with respect to the addition of new morpho- 2. Uncertain dates, places persons and if possible logical features or POS tags. their mapping on a corresponding knowledge As the correspondences between the Gǝʿz- base; character and the transcriptions are unique, the 3. Vague linguistic expressions; system stores just the labels of the Transcription- 4. Indicators for source quotations; letter objects. All other object labels (Token, 5. Text structure; Gǝʿz-character and GU) are dynamically gener- 6. Linguistic annotation. ated throughout a given correspondence table and the Ids. In this way the system uses less memory We define six Annotation Levels. The Graphi- and it remains error prone during the translitera- cal Unit is a word in the text, i.e., a string delim- tion process. In Figure 1 we present the entire data ited by spaces. Punctuation is separated in a pre- model, including also the other possible annotation processing step as independent words. levels. The GeTa-tool implementing this model is Annotation Units are words, a single letter or a client-application, written in Java and distributed a group of letters inside one word. Annotation as open-source software. spans will be in this case necessary for representing named entities (places, persons, etc.), text struc- 4 Annotation of vagueness and ture, or vague linguistic expressions. Especially uncertainty in historical texts for vague expressions it is extremely important that the model supports discontinuous elements to be The second example discusses the annotation of part of the same annotation. historical texts from the 18th century for which we To each Annotation Span or Annotation Unit want to mark: we attach Annotation Information containing Attribute–Value pairs related to the degree of un- 1. Uncertain characters or words (not entirely de- certainty (fuzzy value), type of linguistic vague- ciphered from the manuscript); ness and source of quotation, respectively, and the 80 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) Figure 2: Annotation Model in HerCoRe trust value of this source. An example of such An- the annotation process. Of course the results of notation is presented in Figure 2. these changes must remain consistent with the an- The aim of such annotations is not to develop notation. This is the responsibility of the annotator an expert system in the classic way, as known from (i.e., if the user changes completely the label of an artificial intelligence. Such expert systems assume Annotation Unit he must ask himself if the new la- that the computer is reasoning and presents its in- bel still corresponds to the annotation). In the par- terpretation to the user. We consider that for inter- ticular examples presented in Sections 3 and 4 we pretation of historical facts such system is not re- encode the model as JSON objects. This allows us liable enough. The background knowledge neces- to keep the required storage space small and en- sary for producing reliable result is huge and relies sures fast access to the data. However, we provide often either on materials which are not available in export to other, in particular XML-based, formats, digital form. Thus our goal is rather to make the which ensures interoperability with other analysis user aware that: tools such as ANNIS or Voyant. Further work in- 1. There is a number of possible answers to one cluded the implementation of the generic model for query, and the annotation of inscriptions of Classical Maya. 2. these possible answers may have different de- grees of reliability (i.e., they are not necessarily Acknowledgements true). This article presents work performed within two The interpretation and the final decision is left en- projects: the work in Section 3 was performed tirely to the user. with the TraCES project (From Translation to Cre- ation: Changes in the Ethiopic Lexicon and Style 5 Conclusions from Late Antiquity to the Middle Ages) supported The annotation model introduced in Section 2 and by the European Research Council. Work per- exemplified in Sections 3 and 4 is flexible and sup- formed in this project was performed together with ports changes of the text to be annotated during Alessandro Bausi, Wolfgang Dickhut, Andreas 81 Proceedings of the Workshop on Computational Methods in the Humanities 2018 (COMHUM 2018) Ellwardt, Susanne Hummel, Vitagrazia Pissani, de Castilho, Richard Eckart, Éva Mújdricza- and Eugenia Sokolinski. The work in Section 4 Maydt, Seid Muhie Yimam, Silvana Hartmann, Iryna Gurevych, Anette Frank, and Chris Bie- is currently performed within the project Her- mann (2016). A web-based tool for the in- CoRe (Hermeneutic and Computer-based Analysis tegrated annotation of semantic and syntactic of Reliability, Consistency and Vagueness in his- structures. In Proceedings of the LT4DH work- torical Texts) funded by the Volkswagen Founda- shop at COLING 2016, pages 76–84. URL tion within the framework “Mixed Methods in Hu- https://aclweb.org/anthology/W16-4011. manities”). Work reported in this section was done Vertan, Cristina, Andreas Ellwardt, and Susanne in collaboration with Walther v. Hahn and Alptug Hummel (2016). Ein Mehrebenen-Tagging- Güney. Modell für die Annotation altäthiopischer Texte. In Proceedings der DHd-Konferenz 2016. URL http://www.dhd2016.de/abstracts/vortr% C3%A4ge-061.html. References Bollmann, Marcel, Florian Petran, Stefanie Dipper, and Vertan, Cristina, Walther von Hahn, and Anca Dinu Julia Krasselt (2014). CorA: A web-based annotation (2017). On the annotation of vague expressions: a tool for historical and other non-standard language data. case study on Romanian historical texts. In Proceed- In Proceedings of the 8th Workshop on Language Tech- ings of the first Workshop on Language Technology for nology for Cultural Heritage, Social Sciences, and Hu- Digital Humanities in Central and (South-)Eastern Eu- manities (LaTeCH 2014), pages 86–90. URL https: rope, in association with RANLP 2017, pages 24–31. //aclweb.org/anthology/W14-0612. doi:10.26615/978-954-452-049-6_028. 82