=Paper= {{Paper |id=Vol-1918/declerck |storemode=property |title=Software Projects for Developing Digital Humanities Resources |pdfUrl=https://ceur-ws.org/Vol-1918/declerck.pdf |volume=Vol-1918 |authors=Thierry Declerck |dblpUrl=https://dblp.org/rec/conf/gldv/Declerck17 }} ==Software Projects for Developing Digital Humanities Resources== https://ceur-ws.org/Vol-1918/declerck.pdf
        Software Projects for developing Digital Humanities Resources

                                       Thierry Declerck
                              DFKI GmbH, Language Technology Lab
                                     Stuhlsatzenhausweg, 3
                                     D-66123 Saarbrücken
                                    declerck@dfki.de


                    Abstract                                programme2 . We think that a main aspect of this
                                                            success story lies in the fact that the students had
    In this short paper we report on experiences            to work together, building teams for working on
    gained from bachelor and master theses,                 modules and meeting for integrating the work done
    and from a series of software projects con-             so far.
    ducted in cooperation with the Department                  In all the 4 different software projects conducted
    of Computational Linguistics of the Saar-               until now, we could observe that the folktale topic
    land University. Those bachelor/master the-             was a driver calling for participation of a larger
    ses and software projects were dealing with             group of students (they can choose between differ-
    the application of Natural Language Pro-                ent software projects). We describe in the following
    cessing and Semantic Web technologies to                sections the types of approaches we followed and
    the representation and analysis of folktales.           the results that the students generated and made
    Data, codes and results of the software                 available on various repository management ser-
    projects have been made available in vari-              vices, like GitLab, GitHub or Bitbucket. The idea
    ous repository management services, like                of having software projects as a platform followed
    GitLab, GitHub or Bitbucket. We think that              the work done by two students in their master and
    it will be important to discuss the design              bachelor theses, which were written in the context
    of such openly accessible repositories in               of their Research Assistant appointments within a
    order to ensure their re-usability and fur-             larger national project3 . We describe briefly the
    ther extensions across various educational              results of all those endeavours in the following sec-
    institutions.                                           tions.
1   Introduction
                                                            2    Annotations
In the past 3-4 years we proposed in cooperation
with the Computational Linguistics (CL) depart-             In the context of cooperation between the past D-
ment of the Saarland University a series of bache-          SPIN4 and AMICUS projects5 a master thesis was
lor/master theses and software projects, which were         written by the student Antonia Scheidel on the
dealing with various aspects related to the wider
                                                            or          http://www.abitur-und-studium.de/
field of folktales and therefore introducing Digital        Bilder/Jana-Ott-Christian-Eisenreich-
Humanities (DH) topics to students trained primar-          und-Christian-Willms-Studenten-von-
                                                            Thierry-Declerck-haben-ein-Programm-
ily to learn and apply computational methods of             entwickelt-das-Maerchen-vorlesen-
language technologies.                                      kann.aspx
                                                                2 See
   Our diagnosis was that the approach building                            http://kulturellebildung.de/fa/
                                                            user/Fachbereiche/Literatur_Sprache/
on software projects for introducing CL students,           Aktuelles/141121_PRESSE_Erzaehlen.pdf
and some few students from other departments, to                3 We do think that involvement of students as Research

Digital Humanities topics has been very successful.         Assistant in projects is an important aspect to be considered.
                                                                4 D-SPIN was a predecessor of CLARIN-D. See
It is also the case that some of the projects we
                                                            https://weblicht.sfs.uni-tuebingen.de/
conducted have gained the interest of a broader             englisch/index.shtml
public, including press coverage1 and a broadcast               5 AMICUS: Automated Motif Discovery in Cultural Her-
                                                            itage and Scientific Communication Texts, was a Dutch project
  1 http://derstandard.at/2000004368363/                    dealing partly with the annotation of folktales with recurrent
Wenn-der-Computer-zum-Maerchenonkel-wird                    Motifs. See https://ilk.uvt.nl/amicus/




                                                       23
annotation of fairy tales with Propp’s functions6 .           written a program that analyzes the text according
Vladimir Propp “was a Soviet folklorist and scholar           to linguistic criteria, with the aim of recognizing
who analysed the basic plot components of Rus-                the (main) characters in it, and storing those in a
sian folk tales to identify their simplest irreducible        database. This database is of the “Ontology” type,
narrative elements.”7 . Those basic plot elements             on the base of which logical operations can be per-
are called by Propp “functions” and he identified             formed. The background is a formal description of
31 such functions, like “Interdiction”, “Delivery”            what can be found in these fairy tales, including an
or “Rescue”, etc. Propp also introduced circa 150             ontology about family relations. Thus, the system
sub-functions that are specialisations of the 31 top-         can recognize that in the text “the daughter” is the
level functions. Complementary to the functions,              same person as the “sister” when this is suggested
Propp identified 7 broad characters, like “the vil-           by the context. This way, recognized characters in
lain”, “the donor” or the “hero”. The “morphology             fairy tales are semantically annotated with more
of the tale” described by Vladimir Propp was based            general categories, like “Woman”. And we then
on a subset of the so-called Afanasyev collection             know in which contexts (or situations) a specific
of Russian Folktales8 .                                       family member (for example the “daughter”) is in-
   Antonia Scheidel developed a new annotation                volved (see (Declerck et al., 2012) and (Koleva
scheme according to which fairy tales can be                  et al., 2012) for more details on the results of her
queried for texts, temporal structures, characters,           work.).
dialogues, and Propp’s functions9 . The annota-                  Once we had those resources, i.e an annotation
tion scheme has been named APftML, standing for               framework for folktales, based in a first instance of
“Augmented Propp fairy tale Mark-up Language”.                the mark-up of Proppian functions, and an ontology
Antonia Scheidels’ work is documented in (De-                 framework in which characters playing a role in
clerck and Scheidel, 2010) and (Declerck et al.,              folktales are stored as instances of domain-specific
2011). Annotated fairytale textual data is important          classes, the idea was to extend those to a larger
in that automated systems have a data set against             framework supporting DH application scenarios.
which they can map their results (see, for exam-
ple (Scheidel and Declerck, 2010), describing an              4   Approaches to Story Segmentation
information extraction application in the folktale
                                                              In a first software project which was building on the
domain)10 . If fairy tales are manually annotated
                                                              top of the two resources mentioned in the previous
with the annotation scheme, the results of the auto-
                                                              sections, a division of work could be established
matic processing can be compared with the human
                                                              between the four members of the project team. One
annotation.
                                                              task consisted in offering a meaningful segmenta-
                                                              tion of the tales. The approach for this consisted in
3    Syntactic Analysis and a first Ontology
                                                              automatically segmenting the tales along the lines
Based on the annotation framework mentioned in                of the dialogue structure. This had one motivation:
the previous section, Nikolina Koleva has worked              to offer a base for the integration of a text-to-speech
for her bachelor thesis on an automated system                system supporting the “read aloud” of a tale, in
for processing fairy tale texts. She considered for           which voices are associated to each contributors to
her work two tales, “The Magic Swan Geese”, an                the dialogues (and for sure one voice for the narra-
English version of the Russian fairy tale “Gusi-              tor). This application is described in more details
lebedi”, and “Väterchen Frost”, a German version             in the next section.
of the Russian fairy tale “Djed Moros”. She has                  The students worked in this project mainly on
                                                              the English version of the “Froschkönig” tale (The
    6 See (Propp, 1968)
    7 https://en.wikipedia.org/wiki/                          Frog Prince)11 . Following those new steps, the
Vladimir_Propp                                                initial annotation format has been augmented with
    8 See      https://en.wikipedia.org/wiki/                 detailed dialogue descriptions. And the ontology
Alexander_Afanasyev                                           has also been extended, including now a descrip-
    9 The  annotation scheme can be downloaded
at           http://www.coli.uni-saarland.de/                 tion of dialogues (questions, answers, monologues
˜ascheidel/APftML.xsd                                         etc.), including the encodings of the participants
   10 Examples of such annotated data can be down-
loaded at http://www.coli.uni-saarland.de/                      11 See https://en.wikipedia.org/wiki/The_

˜ascheidel/APftML.xml                                         Frog_Prince




                                                         24
and the dialogue turns. In the two most recent                       6    Iterative Ontology Developments
and currently still running software projects the
students are implementing a strategy on addition-                    We described in sections 4 and 5 how the original
ally segmenting a tale by the locations in which                     ontology has been enriched with additional features.
events are occurring. There is an interesting corre-                 In a second software project, work was dedicated
lation between the segmentation by dialogues and                     in the ontologisation of classical knowledge – in-
the one by locations, as in this kind of narratives                  dexation and classification – resources in the field
the participants to a dialogue are often sharing a                   of folklore. We were considering in this software
location.                                                            project two such resources: The “Motif-index of
                                                                     folk-literature” (Thompson, 1955 1958) and the
5    Emotions Detection and                                          “Types of International Folktales” (Uther, 2004).
     Text-to-Speech Modules                                          The first resource, which we abbreviate as TMI, is
                                                                     available as an on-line resource17 . A folktale motif
One student had the task to implement a program
                                                                     can be defined as a “repeated story element, e.g., a
able to detect emotions. For this the original anno-
                                                                     character, an object, an action, or an event that can
tation scheme has been extended, supporting the
                                                                     be found in several stories”18 . In TMI all motifs
mark-up of 6 basic emotions (fear, grief, joy, etc.),
                                                                     are organized in a tree structure, so that each motif
which are also encoded in the ontology. The auto-
                                                                     has a more abstract class that describes a span of
matic processing of the text (based in this case on
                                                                     subordinated motifs. One motif entry consists of a
the NLTK package12 ) was then marking the emo-
                                                                     motif-id, motif name, motif description (optional),
tion detected in one sentence, on the base of a emo-
                                                                     and references to literature where it occurs.
tion lexicon build from annotated examples that
                                                                        The second resource builds on former work by
served a as seed that was completed by consulting
                                                                     Antti Aarne (Aarne, 1961) and Stith Thompson.
the WordNet13 module implemented in NLTK14 .
                                                                     This classification system was extended by Hans-
   A major extension of the past work in this soft-
                                                                     Jörg Uther (see (Uther, 2004)), and in the following
ware project was that synthetic voices also play a
                                                                     we are using the acronym ATU for referring to this
role. Once a character has been recognized, for
                                                                     resource. A folktale type can be described as a
example the princess (in the fairy tale “Frog King”)
                                                                     main story line that can be found in several cultures.
additional features are coded (for example age, gen-
                                                                     The parts of this story line can refer to specific story
der, emotion, etc.). Then a previously defined syn-
                                                                     elements also known as motifs. A folktale type is
thetic voice is automatically added to the charac-
                                                                     therefore a bigger unit than a motif.
ter. And when the text is processed by the system,
the story can be “told” by the voices. If there is                      Our approach consisted in extracting from those
no detected character in a dialogue situation, it is                 knowledge resources, which are stored in different
assumed that the narrator is the speaker and the                     formats, classification relevant information and to
reader the receiver. A demo can be heard in the cor-                 re-organize them in two interrelated ontologies, us-
responding Bitbucket repository15 . In this software                 ing for this the W3C standards OWL19 , RDF(s)20
project we made use of the “Mary” Text-To-Speech                     and RDF21 .
System16 . The overall results of the projects are                      The integrated ontology resulting from the
described also in (Eisenreich et al., 2014).                         software project, also after curation done in
   12 NLTK stands for “Natural Language Toolkit” and is writ-
                                                                     the context of an internship at DFKI, contains
ten in Python, including a lot of corpus processing and statisti-    46,950 motifs for the TMI domain and 2802
cal libraries. See http://www.nltk.org/                              elements for the ATU domain, most of them
   13 See https://wordnet.princeton.edu/ for
                                                                     interrelated by corresponding properties. Re-
more details.
   14 See http://www.nltk.org/howto/wordnet.                         sults of this software project are available in a
html for more details.
                                                                         17 https://sites.ualberta.ca/
   15 The    data,    algorithms and results of the                                                  ˜urban/
projects     are    stored     in    https://bitbucket.              Projects/English/Motif_Index.htm.
                                                                       18 https://en.wikipedia.org/wiki/Motif_
org/ceisen/apftml2repo.                       A demo of
the TTS application is available at:                  https:         (folkloristics)
                                                                       19 See           http://www.w3.org/TR/owl-
//bitbucket.org/ceisen/apftml2repo/src/
cbf4d71de7f96146d17c4c84572ceb9a99cd300f/                            semantics/.
example%20output/audio_output.mp3?at=                                  20 See    http://www.w3.org/TR/rdf-schema/
master&fileviewer=file-view-default                                  formoredetails.
   16 See http://mary.dfki.de/ for more details.                       21 See https://www.w3.org/RDF/ for more details.




                                                                25
GitLab repository: https://gitlab.com/                           tation of characters in folktales. In Kalliopi Zer-
folktaleclassification/.                                         vanou and Antal van den Bosch and, editors, Pro-
                                                                 ceedings of the 6th Workshop on Language Tech-
   An application of this new integrated ontology
                                                                 nology for Cultural Heritage, Social Sciences, and
for the classification of characters in folktales has            Humanities (LaTeCH 2012), pages 30–35, 209 N.
been presented in (Declerck et al., 2016) and more               Eighth Street Stroudsburg, PA 18360 USA, 4. Asso-
recent developments related to this integrated on-               ciation for Computational Linguistics (ACL), ACL.
tology are described in (Declerck et al., 2017).               Thierry Declerck, Tyler Klement, and Antónia Kostová.
                                                                 2016. Towards a wordnet based classification of ac-
7   Conclusion                                                   tors in folktales. In Verginica Barbu Mititelu, Co-
                                                                 rina Forascu, Christiane Fellbaum, and Piek Vossen,
We did report on specific teaching activities in the             editors, Proceedings of the Eighth Global WordNet
field of the representation and processing of folk-              Conference. Global WordNet Association, GWA, 1.
tales by students (mainly) in the field of computa-
                                                               Thierry Declerck, Antónia Kostová, and Lisa Schäfer.
tional linguistics. The specificity of the experiences           2017. Towards a linked data access to folktales
we are reporting is that those activities took place in          classified by thompsons motifs and aarne-thompson-
the context of software projects or internships, thus            uthers types. In Proceedings of Digital Humanities
with a focus on practical implementation and devel-              2017. ADHO, 8.
opment works. We noticed that this kind of team                Christian Eisenreich, Jana Ott, Tonio Sdorf, Christian
work, or also compact work done in the context                   Willms, and Thierry Declerck. 2014. From tale to
of an internship, is delivering a very large amount              speech: Ontology-based emotion and dialogue anno-
                                                                 tation of fairy tales with a tts output. In Proceedings
of resources that are potentially very relevant for
                                                                 of ISWC 2014. Springer.
being reused in other type of teaching activities.
Maybe also a coordinated action between univer-                Nikolina Koleva, Thierry Declerck, and Hans-Ulrich
sities and other educational institutions toward the             Krieger. 2012. An ontology-based iterative text pro-
                                                                 cessing strategy for detecting and recognizing char-
organization of such software projects could be an               acters in folktales. In Jan Christoph Meister, edi-
idea to discuss and implement. Last but not least,               tor, Digital Humanities 2012 Conference Abstracts,
many of the results presented in this short paper                pages 467–470, Hamburg, 7. University of Hamburg,
have been submitted to and accepted at relevant                  Hamburg University Press.
workshops and conferences, bringing the students               Vladimir Propp. 1968. Morphology of the folktale.
thus also closer to this type of academic achieve-               Trans., Laurence Scott. 2nd ed., University of Texas
ments.                                                           Press.
                                                               Antonia Scheidel and Thierry Declerck. 2010. Apftml
                                                                - augmented proppian fairy tale markup language. In
References                                                       Sándor Darányi and Piroska Lendvai, editors, First
Antti Aarne. 1961. The Types of the Folktale: A Clas-            International AMICUS Workshop on Automated Mo-
  sification and Bibliography. The Finnish Academy               tif Discovery in Cultural Heritage and Scientific
  of Science and Letters. Translated and Enlarged by             Communication Texts. Szeged University.
  S. Thompson. Second Revision (FFC 184).                      Stith Thompson. 1955−1958. Motif-index of folk-
Thierry Declerck and Antonia Scheidel. 2010. An in-               literature: A classification of narrative elements in
  formation extraction approach to the semantic anno-             folktales, ballads, myths, fables, medieval romances,
  tation of folktales. In Sándor Darányi and Piroska            exempla, fabliaux, jest-books, and local legends. Re-
  Lendvai, editors, First International AMICUS Work-              vised and enlarged edition, Indiana University Press.
  shop on Automated Motif Discovery in Cultural Her-           Hans-Jörg Uther. 2004. The Types of International
  itage and Scientific Communication Texts. Univer-              Folktales: A Classification and Bibliography. Based
  sity of Szeged, Hungary.                                       on the system of Antti Aarne and Stith Thompson.
                                                                 Suomalainen Tiedeakatemia.
Thierry Declerck, Antonia Scheidel, and Piroska Lend-
  vai. 2011. Proppian content descriptors in an in-
  tegrated annotation schema for fairy tales. In Lan-
  guage Technology for Cultural Heritage. Selected
  Papers from the LaTeCH Workshop Series, Theory
  and Applications of Natural Language Processing,
  pages 155–169. Springer.

Thierry Declerck, Nikolina Koleva, and Hans-Ulrich
  Krieger. 2012. Ontology-based incremental anno-




                                                          26