=Paper=
{{Paper
|id=Vol-1918/declerck
|storemode=property
|title=Software Projects for Developing Digital Humanities Resources
|pdfUrl=https://ceur-ws.org/Vol-1918/declerck.pdf
|volume=Vol-1918
|authors=Thierry Declerck
|dblpUrl=https://dblp.org/rec/conf/gldv/Declerck17
}}
==Software Projects for Developing Digital Humanities Resources==
Software Projects for developing Digital Humanities Resources Thierry Declerck DFKI GmbH, Language Technology Lab Stuhlsatzenhausweg, 3 D-66123 Saarbrücken declerck@dfki.de Abstract programme2 . We think that a main aspect of this success story lies in the fact that the students had In this short paper we report on experiences to work together, building teams for working on gained from bachelor and master theses, modules and meeting for integrating the work done and from a series of software projects con- so far. ducted in cooperation with the Department In all the 4 different software projects conducted of Computational Linguistics of the Saar- until now, we could observe that the folktale topic land University. Those bachelor/master the- was a driver calling for participation of a larger ses and software projects were dealing with group of students (they can choose between differ- the application of Natural Language Pro- ent software projects). We describe in the following cessing and Semantic Web technologies to sections the types of approaches we followed and the representation and analysis of folktales. the results that the students generated and made Data, codes and results of the software available on various repository management ser- projects have been made available in vari- vices, like GitLab, GitHub or Bitbucket. The idea ous repository management services, like of having software projects as a platform followed GitLab, GitHub or Bitbucket. We think that the work done by two students in their master and it will be important to discuss the design bachelor theses, which were written in the context of such openly accessible repositories in of their Research Assistant appointments within a order to ensure their re-usability and fur- larger national project3 . We describe briefly the ther extensions across various educational results of all those endeavours in the following sec- institutions. tions. 1 Introduction 2 Annotations In the past 3-4 years we proposed in cooperation with the Computational Linguistics (CL) depart- In the context of cooperation between the past D- ment of the Saarland University a series of bache- SPIN4 and AMICUS projects5 a master thesis was lor/master theses and software projects, which were written by the student Antonia Scheidel on the dealing with various aspects related to the wider or http://www.abitur-und-studium.de/ field of folktales and therefore introducing Digital Bilder/Jana-Ott-Christian-Eisenreich- Humanities (DH) topics to students trained primar- und-Christian-Willms-Studenten-von- Thierry-Declerck-haben-ein-Programm- ily to learn and apply computational methods of entwickelt-das-Maerchen-vorlesen- language technologies. kann.aspx 2 See Our diagnosis was that the approach building http://kulturellebildung.de/fa/ user/Fachbereiche/Literatur_Sprache/ on software projects for introducing CL students, Aktuelles/141121_PRESSE_Erzaehlen.pdf and some few students from other departments, to 3 We do think that involvement of students as Research Digital Humanities topics has been very successful. Assistant in projects is an important aspect to be considered. 4 D-SPIN was a predecessor of CLARIN-D. See It is also the case that some of the projects we https://weblicht.sfs.uni-tuebingen.de/ conducted have gained the interest of a broader englisch/index.shtml public, including press coverage1 and a broadcast 5 AMICUS: Automated Motif Discovery in Cultural Her- itage and Scientific Communication Texts, was a Dutch project 1 http://derstandard.at/2000004368363/ dealing partly with the annotation of folktales with recurrent Wenn-der-Computer-zum-Maerchenonkel-wird Motifs. See https://ilk.uvt.nl/amicus/ 23 annotation of fairy tales with Propp’s functions6 . written a program that analyzes the text according Vladimir Propp “was a Soviet folklorist and scholar to linguistic criteria, with the aim of recognizing who analysed the basic plot components of Rus- the (main) characters in it, and storing those in a sian folk tales to identify their simplest irreducible database. This database is of the “Ontology” type, narrative elements.”7 . Those basic plot elements on the base of which logical operations can be per- are called by Propp “functions” and he identified formed. The background is a formal description of 31 such functions, like “Interdiction”, “Delivery” what can be found in these fairy tales, including an or “Rescue”, etc. Propp also introduced circa 150 ontology about family relations. Thus, the system sub-functions that are specialisations of the 31 top- can recognize that in the text “the daughter” is the level functions. Complementary to the functions, same person as the “sister” when this is suggested Propp identified 7 broad characters, like “the vil- by the context. This way, recognized characters in lain”, “the donor” or the “hero”. The “morphology fairy tales are semantically annotated with more of the tale” described by Vladimir Propp was based general categories, like “Woman”. And we then on a subset of the so-called Afanasyev collection know in which contexts (or situations) a specific of Russian Folktales8 . family member (for example the “daughter”) is in- Antonia Scheidel developed a new annotation volved (see (Declerck et al., 2012) and (Koleva scheme according to which fairy tales can be et al., 2012) for more details on the results of her queried for texts, temporal structures, characters, work.). dialogues, and Propp’s functions9 . The annota- Once we had those resources, i.e an annotation tion scheme has been named APftML, standing for framework for folktales, based in a first instance of “Augmented Propp fairy tale Mark-up Language”. the mark-up of Proppian functions, and an ontology Antonia Scheidels’ work is documented in (De- framework in which characters playing a role in clerck and Scheidel, 2010) and (Declerck et al., folktales are stored as instances of domain-specific 2011). Annotated fairytale textual data is important classes, the idea was to extend those to a larger in that automated systems have a data set against framework supporting DH application scenarios. which they can map their results (see, for exam- ple (Scheidel and Declerck, 2010), describing an 4 Approaches to Story Segmentation information extraction application in the folktale In a first software project which was building on the domain)10 . If fairy tales are manually annotated top of the two resources mentioned in the previous with the annotation scheme, the results of the auto- sections, a division of work could be established matic processing can be compared with the human between the four members of the project team. One annotation. task consisted in offering a meaningful segmenta- tion of the tales. The approach for this consisted in 3 Syntactic Analysis and a first Ontology automatically segmenting the tales along the lines Based on the annotation framework mentioned in of the dialogue structure. This had one motivation: the previous section, Nikolina Koleva has worked to offer a base for the integration of a text-to-speech for her bachelor thesis on an automated system system supporting the “read aloud” of a tale, in for processing fairy tale texts. She considered for which voices are associated to each contributors to her work two tales, “The Magic Swan Geese”, an the dialogues (and for sure one voice for the narra- English version of the Russian fairy tale “Gusi- tor). This application is described in more details lebedi”, and “Väterchen Frost”, a German version in the next section. of the Russian fairy tale “Djed Moros”. She has The students worked in this project mainly on the English version of the “Froschkönig” tale (The 6 See (Propp, 1968) 7 https://en.wikipedia.org/wiki/ Frog Prince)11 . Following those new steps, the Vladimir_Propp initial annotation format has been augmented with 8 See https://en.wikipedia.org/wiki/ detailed dialogue descriptions. And the ontology Alexander_Afanasyev has also been extended, including now a descrip- 9 The annotation scheme can be downloaded at http://www.coli.uni-saarland.de/ tion of dialogues (questions, answers, monologues ˜ascheidel/APftML.xsd etc.), including the encodings of the participants 10 Examples of such annotated data can be down- loaded at http://www.coli.uni-saarland.de/ 11 See https://en.wikipedia.org/wiki/The_ ˜ascheidel/APftML.xml Frog_Prince 24 and the dialogue turns. In the two most recent 6 Iterative Ontology Developments and currently still running software projects the students are implementing a strategy on addition- We described in sections 4 and 5 how the original ally segmenting a tale by the locations in which ontology has been enriched with additional features. events are occurring. There is an interesting corre- In a second software project, work was dedicated lation between the segmentation by dialogues and in the ontologisation of classical knowledge – in- the one by locations, as in this kind of narratives dexation and classification – resources in the field the participants to a dialogue are often sharing a of folklore. We were considering in this software location. project two such resources: The “Motif-index of folk-literature” (Thompson, 1955 1958) and the 5 Emotions Detection and “Types of International Folktales” (Uther, 2004). Text-to-Speech Modules The first resource, which we abbreviate as TMI, is available as an on-line resource17 . A folktale motif One student had the task to implement a program can be defined as a “repeated story element, e.g., a able to detect emotions. For this the original anno- character, an object, an action, or an event that can tation scheme has been extended, supporting the be found in several stories”18 . In TMI all motifs mark-up of 6 basic emotions (fear, grief, joy, etc.), are organized in a tree structure, so that each motif which are also encoded in the ontology. The auto- has a more abstract class that describes a span of matic processing of the text (based in this case on subordinated motifs. One motif entry consists of a the NLTK package12 ) was then marking the emo- motif-id, motif name, motif description (optional), tion detected in one sentence, on the base of a emo- and references to literature where it occurs. tion lexicon build from annotated examples that The second resource builds on former work by served a as seed that was completed by consulting Antti Aarne (Aarne, 1961) and Stith Thompson. the WordNet13 module implemented in NLTK14 . This classification system was extended by Hans- A major extension of the past work in this soft- Jörg Uther (see (Uther, 2004)), and in the following ware project was that synthetic voices also play a we are using the acronym ATU for referring to this role. Once a character has been recognized, for resource. A folktale type can be described as a example the princess (in the fairy tale “Frog King”) main story line that can be found in several cultures. additional features are coded (for example age, gen- The parts of this story line can refer to specific story der, emotion, etc.). Then a previously defined syn- elements also known as motifs. A folktale type is thetic voice is automatically added to the charac- therefore a bigger unit than a motif. ter. And when the text is processed by the system, the story can be “told” by the voices. If there is Our approach consisted in extracting from those no detected character in a dialogue situation, it is knowledge resources, which are stored in different assumed that the narrator is the speaker and the formats, classification relevant information and to reader the receiver. A demo can be heard in the cor- re-organize them in two interrelated ontologies, us- responding Bitbucket repository15 . In this software ing for this the W3C standards OWL19 , RDF(s)20 project we made use of the “Mary” Text-To-Speech and RDF21 . System16 . The overall results of the projects are The integrated ontology resulting from the described also in (Eisenreich et al., 2014). software project, also after curation done in 12 NLTK stands for “Natural Language Toolkit” and is writ- the context of an internship at DFKI, contains ten in Python, including a lot of corpus processing and statisti- 46,950 motifs for the TMI domain and 2802 cal libraries. See http://www.nltk.org/ elements for the ATU domain, most of them 13 See https://wordnet.princeton.edu/ for interrelated by corresponding properties. Re- more details. 14 See http://www.nltk.org/howto/wordnet. sults of this software project are available in a html for more details. 17 https://sites.ualberta.ca/ 15 The data, algorithms and results of the ˜urban/ projects are stored in https://bitbucket. Projects/English/Motif_Index.htm. 18 https://en.wikipedia.org/wiki/Motif_ org/ceisen/apftml2repo. A demo of the TTS application is available at: https: (folkloristics) 19 See http://www.w3.org/TR/owl- //bitbucket.org/ceisen/apftml2repo/src/ cbf4d71de7f96146d17c4c84572ceb9a99cd300f/ semantics/. example%20output/audio_output.mp3?at= 20 See http://www.w3.org/TR/rdf-schema/ master&fileviewer=file-view-default formoredetails. 16 See http://mary.dfki.de/ for more details. 21 See https://www.w3.org/RDF/ for more details. 25 GitLab repository: https://gitlab.com/ tation of characters in folktales. In Kalliopi Zer- folktaleclassification/. vanou and Antal van den Bosch and, editors, Pro- ceedings of the 6th Workshop on Language Tech- An application of this new integrated ontology nology for Cultural Heritage, Social Sciences, and for the classification of characters in folktales has Humanities (LaTeCH 2012), pages 30–35, 209 N. been presented in (Declerck et al., 2016) and more Eighth Street Stroudsburg, PA 18360 USA, 4. Asso- recent developments related to this integrated on- ciation for Computational Linguistics (ACL), ACL. tology are described in (Declerck et al., 2017). Thierry Declerck, Tyler Klement, and Antónia Kostová. 2016. Towards a wordnet based classification of ac- 7 Conclusion tors in folktales. In Verginica Barbu Mititelu, Co- rina Forascu, Christiane Fellbaum, and Piek Vossen, We did report on specific teaching activities in the editors, Proceedings of the Eighth Global WordNet field of the representation and processing of folk- Conference. Global WordNet Association, GWA, 1. tales by students (mainly) in the field of computa- Thierry Declerck, Antónia Kostová, and Lisa Schäfer. tional linguistics. The specificity of the experiences 2017. Towards a linked data access to folktales we are reporting is that those activities took place in classified by thompsons motifs and aarne-thompson- the context of software projects or internships, thus uthers types. In Proceedings of Digital Humanities with a focus on practical implementation and devel- 2017. ADHO, 8. opment works. We noticed that this kind of team Christian Eisenreich, Jana Ott, Tonio Sdorf, Christian work, or also compact work done in the context Willms, and Thierry Declerck. 2014. From tale to of an internship, is delivering a very large amount speech: Ontology-based emotion and dialogue anno- tation of fairy tales with a tts output. In Proceedings of resources that are potentially very relevant for of ISWC 2014. Springer. being reused in other type of teaching activities. Maybe also a coordinated action between univer- Nikolina Koleva, Thierry Declerck, and Hans-Ulrich sities and other educational institutions toward the Krieger. 2012. An ontology-based iterative text pro- cessing strategy for detecting and recognizing char- organization of such software projects could be an acters in folktales. In Jan Christoph Meister, edi- idea to discuss and implement. Last but not least, tor, Digital Humanities 2012 Conference Abstracts, many of the results presented in this short paper pages 467–470, Hamburg, 7. University of Hamburg, have been submitted to and accepted at relevant Hamburg University Press. workshops and conferences, bringing the students Vladimir Propp. 1968. Morphology of the folktale. thus also closer to this type of academic achieve- Trans., Laurence Scott. 2nd ed., University of Texas ments. Press. Antonia Scheidel and Thierry Declerck. 2010. Apftml - augmented proppian fairy tale markup language. In References Sándor Darányi and Piroska Lendvai, editors, First Antti Aarne. 1961. The Types of the Folktale: A Clas- International AMICUS Workshop on Automated Mo- sification and Bibliography. The Finnish Academy tif Discovery in Cultural Heritage and Scientific of Science and Letters. Translated and Enlarged by Communication Texts. Szeged University. S. Thompson. Second Revision (FFC 184). Stith Thompson. 1955−1958. Motif-index of folk- Thierry Declerck and Antonia Scheidel. 2010. An in- literature: A classification of narrative elements in formation extraction approach to the semantic anno- folktales, ballads, myths, fables, medieval romances, tation of folktales. In Sándor Darányi and Piroska exempla, fabliaux, jest-books, and local legends. Re- Lendvai, editors, First International AMICUS Work- vised and enlarged edition, Indiana University Press. shop on Automated Motif Discovery in Cultural Her- Hans-Jörg Uther. 2004. The Types of International itage and Scientific Communication Texts. Univer- Folktales: A Classification and Bibliography. Based sity of Szeged, Hungary. on the system of Antti Aarne and Stith Thompson. Suomalainen Tiedeakatemia. Thierry Declerck, Antonia Scheidel, and Piroska Lend- vai. 2011. Proppian content descriptors in an in- tegrated annotation schema for fairy tales. In Lan- guage Technology for Cultural Heritage. Selected Papers from the LaTeCH Workshop Series, Theory and Applications of Natural Language Processing, pages 155–169. Springer. Thierry Declerck, Nikolina Koleva, and Hans-Ulrich Krieger. 2012. Ontology-based incremental anno- 26