<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sentiment Annotation of Historic German Plays: An Empirical Study on Annotation Behavior</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Schmidt</string-name>
          <email>thomas.schmidt@ur.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Burghardt</string-name>
          <email>burghardt@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katrin Dennerlein</string-name>
          <email>katrin.dennerlein@uni-wuerzburg.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computational Humanities Group, Leipzig University</institution>
          ,
          <addr-line>04109 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of German Philology, Würzburg University</institution>
          ,
          <addr-line>97074 Würzburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Media Informatics Group, Regensburg University</institution>
          ,
          <addr-line>93040 Regensburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>47</fpage>
      <lpage>52</lpage>
      <abstract>
        <p>We present results of a sentiment annotation study in the context of historical German plays. Our annotation corpus consists of 200 representative speeches from the German playwright Gotthold Ephraim Lessing. Six annotators, five non-experts and one expert in the domain, annotated the speeches according to different sentiment annotation schemes. They had to annotate the differentiated polarity (very negative, negative, neutral, mixed, positive, very positive), the binary polarity (positive/negative) and the occurrence of eight basic emotions. After the annotation, the participants completed a questionnaire about their experience of the annotation process; additional feedback was gathered in a closing interview. Analysis of the annotations shows that the agreement among annotators ranges from low to mediocre. The non-expert annotators perceive the task as very challenging and report different problems in understanding the language and the context. Although fewer problems occur for the expert annotator, we cannot find any differences in the agreement levels among non-experts and between the expert and the non-experts. At the end of the paper, we discuss the implications of this study and future research plans for this area.</p>
      </abstract>
      <kwd-group>
        <kwd>sentiment analysis</kwd>
        <kwd>sentiment annotation</kwd>
        <kwd>drama</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The analysis of emotions, affects, moods, feelings and
sentiments in literary texts and their effect on the reader has
a long hermeneutical tradition in literary studies
        <xref ref-type="bibr" rid="ref11 ref12 ref29">(Winko,
2003; Meyer-Sickendiek, 2005; Mellmann, 2015)</xref>
        . Lately,
this area of study has been enhanced by computational
sentiment analysis techniques, which are used to
automatically predict sentiments and emotions in written
texts
        <xref ref-type="bibr" rid="ref1 ref13 ref16 ref2 ref21 ref27 ref5 ref7 ref8">(cf. Alm et al., 2005; Volkova et al., 2010; Jannidis et
al., 2016; Kakkonen &amp; Kakkonen, 2011; Kao &amp; Jurafsky,
2012; Mohammad, 2011; Nalisnick &amp; Baird, 2013;
Schmidt et al., 2018)</xref>
        . Sentiment analysis has become one
of the most active areas of research in computational
linguistics in recent years
        <xref ref-type="bibr" rid="ref26">(Vinodhini &amp; Chandrasekran,
2012)</xref>
        and is typically used for the analysis of online
reviews and social media
        <xref ref-type="bibr" rid="ref10">(Liu, 2016)</xref>
        . However, a major
problem for the application of sentiment analysis methods
for literary texts is the lack of human-annotated training
data. Such data is an important prerequisite for the
evaluation of dictionary-based approaches (lists of words
annotated with sentiment information), which are among
the most popular methods for the sentiment analysis of
literary texts
        <xref ref-type="bibr" rid="ref13 ref16 ref21">(Mohammad, 2011; Nalisnick &amp; Baird, 2013;
Schmidt et al., 2018)</xref>
        . Manually curated training data is
even more important for unsupervised machine learning
approaches, which have been proven to be very successful
in the context of other areas of sentiment analysis
        <xref ref-type="bibr" rid="ref17">(Pang et
al., 2002)</xref>
        .
      </p>
      <p>
        Not only is there a lack of available training data; we
currently also lack research concerning difficulties and
problems in the transfer of standard methods for sentiment
annotation (mostly used in online reviews and social
media) to the field of narrative texts. For the area of fairy
tales,
        <xref ref-type="bibr" rid="ref1 ref2">Alm and Sproat (2005)</xref>
        conducted annotation studies
and reported several problems, such as low agreement
among annotators, strong imbalances concerning the
distribution of sentiments and misinterpretations of the
sentiment annotation scheme. Another question that arises,
is the level of expertise necessary to correctly annotate
sentiment: In the context of historical political texts,
        <xref ref-type="bibr" rid="ref23">Sprugnoli et al. (2016)</xref>
        have found strong differences in
annotations among experts, among participants of a
crowdsourcing project and between the experts and the
crowd. Furthermore, the special needs in sentiment
annotation as well as requirements concerning the analysis
of sentiments and emotions in narrative and poetic texts
have yet to be explored.
        <xref ref-type="bibr" rid="ref23">Sprugnoli et al. (2016)</xref>
        were able
to identify special interests of professional historians for
sentiment analysis and annotation (e.g. the sentiment of
specific topics rather than text units) by including them in
the annotation process.
      </p>
      <p>As a prerequisite for a large-scale automatic annotation
project, we are currently exploring sentiment annotation for
historic (18th century) plays by G. E. Lessing, to examine
the aforementioned questions and challenges concerning
sentiment annotation of German, literary texts. In this
article, we present preliminary annotation results of our
first experiments with five non-expert annotators and one
expert annotator with a corpus of 200 speeches. In addition,
we also used a questionnaire and conducted interviews with
the annotators to gather more insights concerning the
annotation behavior as well as problems with the
annotation scheme and the overall process. With regard to
our overall project, we want to derive specific requirements
for sentiment annotation in literary studies and examine
which level of expertise is necessary for this specific
context. Thus, we want to aid the development of
annotation schemes and annotation tools for this area and
further support the planning of future annotation studies.
2.</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>The corpus of the overall project consists of twelve plays
and altogether 8,224 speeches with an average length of 24
words per speech. For our annotation study, we randomly
selected a sample of 200 speeches. Five non-expert
annotators (four female and one male) participated in the
study. They were all fluent in German but otherwise no
experts concerning the plays of Lessing. One expert
annotator (female) with a PhD in German literary studies
and with research experience especially about Lessing also
participated in the study. With this sample, we were able to
gather a total of 1,200 annotations.</p>
      <p>
        Since very short speeches may not contain any sentiment
bearing words at all and generally pose challenges for the
annotators due to a lack of context, we only selected
speeches with a minimum length of 19 words, which equals
about -25% of the average speech length. In the final
annotation corpus, speeches had an average length of 50
words. Furthermore, we selected the speeches to reflect the
distribution of speeches for different plays in our corpus,
i.e. plays with overall more speeches are also represented
with more speeches in our test corpus. We excluded
speeches from our test corpus when we assumed language
issues for the annotators, for instance speeches containing
French or Latin words, which may be problematic for the
German speaking annotators. Note that 200 speeches
represent approx. 2% of our entire corpus. Although this
might be considered a rather small sample size, this is not
uncommon for the domain of historical and poetic texts
        <xref ref-type="bibr" rid="ref1 ref2 ref23">(cf.
Alm &amp; Sproat, 2005; Sprugnoli, 2016)</xref>
        , as annotations of
this type are typically a laborious task.
      </p>
      <p>
        The annotators were asked to use a multi-part annotation
scheme based on various existing schemes for sentiment
analysis. Most related studies use a categorical annotation
scheme, differentiating only positive, negative, neutral /
objective, mixed and unknown
        <xref ref-type="bibr" rid="ref19 ref20 ref4">(Bosco et al., 2014; Refaee
&amp; Rieser, 2014; Saif et al., 2013)</xref>
        . Other studies refer to
ordinal or continuous ratings, ranging from positive to
negative
        <xref ref-type="bibr" rid="ref14 ref24">(Takala et al., 2014; Momtazi, 2012)</xref>
        .
        <xref ref-type="bibr" rid="ref28">Wiebe et al.
(2005)</xref>
        developed a more complex scheme consisting of
polarity categories and intensities for these categories.
However, related work shows that oftentimes initially more
sophisticated schemes are later simplified to a binary
variant (positive/negative), since more complicated
schemes cause lower agreement between human annotators
        <xref ref-type="bibr" rid="ref14 ref24">(Momtazi, 2012; Takala et al., 2014)</xref>
        . This reduction can
also be observed in literary studies:
        <xref ref-type="bibr" rid="ref1 ref2">Alm and Sproat (2005)</xref>
        at first used a complex annotation scheme with different
emotional categories but then reduced it to a binary polarity
of “emotion present” and “emotion not present”
        <xref ref-type="bibr" rid="ref1 ref2">(Alm et
al., 2005)</xref>
        .
        <xref ref-type="bibr" rid="ref23">Sprugnoli et al. (2016)</xref>
        chose a basic scheme of
positive, negative, neutral and unknown.
      </p>
      <p>In our study, we wanted to investigate whether this
observation is also true for historic German plays, asking
the annotators to use both, a fairly simple scheme and a
more complex annotation scheme. The annotators were
presented each of the 200 speeches together with the
predecessor and successor speech, to provide the necessary
context for interpretation. First, annotators were asked to
assign one of six categories (very negative, negative,
neutral, mixed, positive and very positive) to each speech.
We will refer to this annotation as differentiated polarity
annotation. Next, they had to assign a binary annotation
(pos/neg). Finally, participants were able to annotate the
presence of one or more emotion categories from a set of
eight basic emotions (anger, fear, surprise, trust,
anticipation, joy, disgust, sadness). Figure 1 illustrates the
annotation process:
For the differentiated polarity and the binary polarity, every
annotator was asked to choose the most adequate sentiment
category. For the emotion category, the instruction was to
mark any emotions that are present in a speech. Every
annotator was personally introduced to the annotation
process, which was also explained with practical examples.
At the end of the overall annotation task, participants were
asked to complete a questionnaire about different facets of
the annotation process. In the first part of the questionnaire,
participants rated their overall impression of the annotation
tasks on a 7-point Likert scale (do not agree at all/fully
agree):





</p>
      <sec id="sec-2-1">
        <title>The annotation of the speeches was difficult.</title>
        <p>(overall-difficulty)
The annotation of the speeches concerning the
polarity was difficult. (polarity-difficulty)
The annotation of the speeches concerning the
emotion categories was difficult.
(emotiondifficulty)
I was very confident with my assignments.
(overall-certainty)
I was very confident with my assignments
concerning the polarities of the speeches.
(polarity-certainty)
I was very confident with my assignments
concerning the emotion categories of the
speeches. (emotion-certainty)
In addition, participants were asked to report how much
time they needed to perform the annotation of all 200
speeches. Annotators were also asked to report about the
most important problems and difficulties in a free response
field. Finally, we conducted a short closing interview with
all participants after the complete annotation task,
discussing their overall experience with the annotation
process.</p>
        <p>3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>As differences and similarities between the annotations of
non-experts and the domain expert are of special interest
for us with regard to the design of future annotation studies,
we will examine these data sets separately. Firstly, we
report the results concerning distributions for the
differentiated polarity annotation among non-experts in
Table 1 (in total 1000 annotations).</p>
      <p>We observed that for the vast majority of annotations our
participants chose negative annotations. They represent
almost 50% of all annotations, while the share of positive
and very positive annotations is significantly lower (16%).
The results also show that the groups “mixed” and
“neutral” are relevant and important annotation groups,
since they appear almost as often or even more frequently
than positive annotations. As for the binary polarity, we
found that 665 annotations (67%) were negative and 335
(33%) positive.</p>
      <p>The results show that the distribution of expert annotations
is overall quite similar to the annotations of the
nonexperts, since the majority of annotations are negative
(62%). However, one major difference is that the expert
annotator rarely used the annotation mixed (3%), while
non-experts used it for 23% of all annotations. For binary
polarity the distribution of the expert is identical to the
nonexperts: 134 positive (67%) and 66 negative (33%)
annotations. Due to the length constraints of this extended
abstract we will not present the results of emotion
annotation in detail. However, some major findings are that
the most frequent emotion annotations are anticipation
(30.6%) and anger (21.1%) while disgust is chosen very
rarely (3.9%). We also examined if a speech is annotated
with at least one emotion. This is the case for the vast
majority of speeches (79.50%).</p>
      <p>
        We performed different statistical tests to analyze the
influence of the length of a speech. An analysis of variances
with the polarity groups (negative, positive, mixed, neutral)
and the length of the speeches shows that there is a
significant effect of length on the chosen polarity
annotation for non-experts, F(3, 997)=4.40, p=0.004.
Speeches annotated as mixed tend to be longer (M=56.35,
SD=45.70) than other speeches and especially than neutral
annotated speeches, which are on average the shortest type
of speeches (M=41.67; SD=28.94). For the expert
annotations, no significant differences among the same
polarity groups could be found. However, descriptive
analysis also shows that negative (M=54.52, SD=47.31)
and mixed (M=55.17, SD=55.17) speeches are
considerably longer than neutral speeches (M=32,
SD=12.98). We made no significant findings concerning
the influence of length on the binary polarity. We also
examined statistics concerning the level of agreement (see
Table 3). As measures, we chose Krippendorff’s α
        <xref ref-type="bibr" rid="ref9">(Krippendorff, 2011)</xref>
        and the average percentage of
agreement of all annotator pairs (APA).
      </p>
      <sec id="sec-3-1">
        <title>Krippendorff’s α 0.22 APA 40%</title>
      </sec>
      <sec id="sec-3-2">
        <title>Differentiated</title>
        <p>polarity
Binary polarity 0.47 77%
Table 3: Measures of agreement for polarity annotations
among non-experts
Krippendorff’s α and the APA for the differentiated
polarity are very low. However, the level of agreement
increases for the binary polarity to a moderate level of
agreement. We could not find a significant influence of
speech length on the level of agreement. To analyze the
difference between the expert and the non-expert annotator
we calculated the agreement of every non-expert with the
expert separately via Cohen’s Kappa (κ) and the APA and
then formed the average of these values. (see Table 4).</p>
      </sec>
      <sec id="sec-3-3">
        <title>Averaged APA</title>
      </sec>
      <sec id="sec-3-4">
        <title>Averaged κ values 0.19</title>
      </sec>
      <sec id="sec-3-5">
        <title>Differentiated 39%</title>
        <p>polarity
Binary polarity 0.45 76%
Table 4: Averaged measures of agreement for polarity
annotations among non-experts with the expert
Krippendorff’s α and Cohen’s κ are related agreement
metrics. Therefore, a comparison is statistically legit.
Similar to the agreement solely among non-experts, the
level of agreement is very low for differentiated polarity
and moderate for the binary polarity. To further analyze
differences between the expert and the non-experts for the
binary polarity, we compared the annotation value for each
speech chosen by the majority of non-experts and
compared it to the annotation of the expert. In 43 (21%)
cases the expert annotation was different to the annotation
of the majority of the non-experts. The numbers are similar
when comparing non-experts among each other. With
regard to the emotion annotation, the calculation of
Krippendorff’s α and κ is skewed since the distribution of
emotions always shows an excessive proportion of “not
present” for all single emotion categories. Therefore, the
APA values are rather high ranging from 61% for
anticipation to 95% for disgust.</p>
        <p>Because of the higher agreement, we chose the binary
polarity as the final determinant for the annotation of
polarity. We assigned each of the 200 speeches with the
consensus of the majority of all annotators (n=6) and
whenever there was no majority, the expert annotation was
used as a tie-breaker. Therefore a speech is assigned with a
category if at least four annotators agree upon it and if it is
tied, the annotation of the expert is chosen (this was the
case 19 times). As a result, 138 speeches were assigned as
negative and 62 as positive.1 Table 5 summarizes the results
of the questionnaire statements concerning the difficulty of
the annotation as well as the confidence about the
annotation decisions among the non-experts and the expert.
A median value of 6 shows that the annotation was
perceived as very challenging by the non-experts. With
regard to the confidence, the median points to a mediocre
certainty for polarities and a rather low certainty for
emotion annotations. The expert however reports that she
perceived the task to be in-between easy and moderately
challenging. However the level of certainty is only slightly
higher compared to the mean values of the non-experts. On
average participants needed around 5 hours to complete the
entire annotation. The expert reported the same amount of
time. Analyzing the answers in the free response field as
well as the post-annotation interviews, the following major
difficulties among non-experts were reported:







</p>
      </sec>
      <sec id="sec-3-6">
        <title>Poetic and archaic language, e.g. unknown words and complex sentences Problems in putting a speech in a content-related overall context</title>
        <p>Interpretation of irony and sarcasm
Multiple Emotions and Polarity-shifts during a
speech, especially longer speeches
Some speeches seem to be meaningless, because
they are to short or consist of irrelevant phrases
The annotation process is perceived as cognitively
very challenging; breaks to refocus concentration
are needed
Sometimes the difficulties in understanding the
content and context of a speech lead to almost
randomly selecting an annotation
It is not always clear what should be annotated:
the sentiment of the language, the sentiment
towards a person, the sentiment towards a subject
or the emotional state of the speaker?
The feedback of the expert included most of the
aforementioned points. However, she didn’t report as many
difficulties with the language and the context and reported
that she often was unsure if she should annotate the
sentiment based on the word-level or based on the overall
context of the text.
1 The corpus with all annotations is available online as a structured table:
https://docs.google.com/spreadsheets/d/1f72hS2WDRBOrxzSY_tsM_ig
ChG2bvxYTyMVZP6kOnuk/edit?usp=sharing
4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>
        The overrepresentation of negatively connoted speeches is
very dominant and also compliant with findings from
        <xref ref-type="bibr" rid="ref1 ref2">Alm
and Sproat (2005)</xref>
        in the context of fairy tales. This is a
remarkable result, since our specific corpus consists mostly
of comedies, which intuitively should be in a rather positive
tone (as opposed to tragedies). This overrepresentation is
also consistent among the expert and non-experts. We are
currently working together with literary scholars to further
explore and interpret this phenomenon. The
overrepresentation is also an important finding for further
annotation studies and sentiment analysis projects, as it
suggests an annotation scheme that differs between
negative sentiments or that uses continuous scales.
Another finding concerning the distribution of sentiments
shows that overall, we have less neutral annotations than in
related studies on narrative and historical texts,
        <xref ref-type="bibr" rid="ref1 ref2 ref23">(Alm &amp;
Sproat, 2005; Sprugnoli, 2016)</xref>
        . We also found that
annotators perceive the presence of at least one emotion for
most of the speeches, underlining that dramas are
particularly suited and interesting for sentiment analysis. In
addition, the class of “mixed” speeches makes for a
substantial part of the corpus, at least for the non-experts.
According to annotations and the statements of the
nonexperts, the main reason for this are over-long speeches,
which oftentimes contain significant changes of sentiment.
Although the expert annotator did not choose the mixed
annotation very often, the problem of polarity shifts and
multiple emotions was also reported. Overall, we conclude
that future annotation schemes should be able to handle
such inter-speech changes for a more precise annotation.
As for annotator agreement, we found low to mediocre
levels of agreement. This observation is compliant to
similar research in the field of narrative and historical texts
        <xref ref-type="bibr" rid="ref1 ref1 ref2 ref2 ref23">(Alm &amp; Sproat, 2005; Alm et al., 2005; Sprugnoli et al.,
2016)</xref>
        , although these studies regard the sentiment of
sentences, and not of drama speeches. However, in similar
annotation studies with other text sorts much higher levels
of agreement are achieved regarding Kappa-statistics,
which are comparable to Krippendorff’s α. The annotator
agreements range from 0.8 to 1.0 for text sorts like movie
reviews
        <xref ref-type="bibr" rid="ref25">(Thet et al., 2010)</xref>
        , social media comments
        <xref ref-type="bibr" rid="ref18">(Prabowo &amp; Thelwall, 2009)</xref>
        , sentences from websites
        <xref ref-type="bibr" rid="ref6">(Kaji &amp; Kitsuregawa, 2007)</xref>
        and microblogs
        <xref ref-type="bibr" rid="ref3">(Bermingham
&amp; Smeaton, 2010)</xref>
        . For our annotation scenario, it is
noticeable that the agreement among non-experts and the
agreement between non-experts and the expert annotator
are both similarly low to mediocre, i.e. that based on the
current data, the difference between an expert and a
nonexpert is very similar to the difference between two or more
non-experts. Overall, the results confirm our assumption
that sentiment annotation of narrative texts is more
problematic than in other fields. It seems to be a rather
subjective annotation task that does not primarily depend
on domain expertise. The low agreement is also important
for future evaluations of sentiment analysis methods since
the level of agreement is often used as performance
baseline (
        <xref ref-type="bibr" rid="ref15">Mozetič et al., 2016</xref>
        ). We will have to investigate
how annotation agreement can be generally improved, e.g.
by more specific introductions to the task and some
common guidelines that give hints on how to use the
annotation scheme and how to deal with problems such as
uncertainty.
      </p>
      <p>
        The results of the questionnaire and the concluding
interview support these claims. Non-expert participants
perceive the annotation as very difficult and they report to
have mediocre certainty about the correctness of their
annotation. They state that the task is cognitively very
challenging and that it demands high levels of
concentration throughout the process. Non-experts also had
issues with the historic language and the context of some
speeches. In contrast, the expert did not perceive the
annotation task too difficult and demanding, and also only
reported minor issues with language and missing context.
The low agreements of more complex schemes would
suggest the usage of a rather simple scheme (e.g. binary
polarity). However, the results of the interviews also show
that the annotation schemes derived from application areas
like product reviews might not be suitable for the use case
of literary text. For example, annotators did not know how
to mark irony, sarcasm or multiple polarity shifts. The
annotators also noted that there are often multiple possible
targets for the annotation of a sentiment and that it is not
always clear which sentiment to choose. Based on this
feedback we suggest to extend the scheme so that the
annotators can distinguish the reference of the sentiment,
e.g. another speaker, a topic or speaker that is directly or
indirectly talked about, etc.
        <xref ref-type="bibr" rid="ref22">(cf. Shin et al., 2012)</xref>
        . As for
another challenge, some annotators also mentioned that
they were sometimes inclined to interpret sentiment from a
rather subjective perspective, as they had personal
associations with some of the speeches. Future research
should pay attention to these problems and instructions and
annotation schemes should be as clear and precise as
possible to avoid confusion.
      </p>
      <p>
        One of our main goals was to explore if non-experts are
potentially capable to perform sentiment annotation for
historic plays, because non-experts are obviously more
available than experts, which is an important aspect for the
design of future large-scale studies. We found that
nonexperts perceived the task as more challenging and more
grave problems occurred, e.g. not understanding the
language or the context correctly. While the usage of
nonexperts in the annotation process is not uncommon for
sentiment analysis
        <xref ref-type="bibr" rid="ref27">(Volkova et al., 2010)</xref>
        , we found that
they seemed to struggle in our particular annotation
scenario. On the other side, the agreement between
nonexperts and the expert is not any different than among
nonexperts only, which indicates that experts also struggle with
the task. This observation is reflected by related studies of
        <xref ref-type="bibr" rid="ref1 ref2">Alm and Sproat (2005)</xref>
        as well as
        <xref ref-type="bibr" rid="ref23">Sprugnoli et al. (2016)</xref>
        ,
who also report low levels of agreement while using trained
students or even more advanced experts. Assuming
expertise is an important factor, the agreement between a
non-expert and the expert should be notably lower. The
distribution of polarities is also very similar between
nonexperts and the expert. Further, taking the majority decision
of all non-expert annotators leads to annotations that are
very similar to the expert’s annotation. With regard to the
time needed to achieve the complete annotation task, there
are also no major differences to be found, as it took around
5 hours to finish the annotation for both, the expert and the
non-experts.
      </p>
      <p>5.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Directions</title>
      <p>We believe it is feasible to use non-experts in a large-scale
crowdsourcing context for the annotation of historic plays,
keeping in mind the improvements regarding the
annotation scheme and instructions mentioned before. As
for the language problems, non-expert annotators could be
provided with a lexicon of the most frequent historic words.
This lexicon could also be used to filter speeches that
contain problematic language, which could then be
reserved for an expert annotator. The issues of missing
context could be easily resolved by providing a digital tool
for the annotation task (which would be needed for a
largescale study in any chase), which would allow for the
optional display of arbitrary portions of context.
As our annotation study was solely focused on speeches,
more complex structural levels such as scenes, acts,
speakers or speaker relations could also be taken into
account for future studies. The interpretation of these levels
would also be necessary to get a more complete view on a
drama. Another problem is that feedback by both, the
expert and the non-experts, points to the lack of precise
instructions to the sentiment annotation task, which
certainly is an influencing factor for low agreements.
Furthermore, we are aware that our sample size with only
one expert is very small, so further research will be
necessary to explore which level of expertise is tolerable
and if there are also significant differences in annotation
behavior between experts and non-experts in a larger study.
We are currently conducting a follow-up annotation study
with trained students of German literary studies to analyze
if the problems described in this article persist, if
differences in the annotation process occur and if their level
of expertise is sufficient. For this study, we will adjust our
annotation scheme and use a bigger corpus and more
participants.</p>
      <p>We also want to further examine how literary scholars
annotate sentiment and which requirements an annotation
scheme for this context has to meet. As a long-term goal,
we would like to develop an annotation scheme optimized
for the context of drama sentiment annotation. By this, we
hope be able to develop tools for more efficient sentiment
annotation and to acquire large-scale annotated corpora for
evaluation and machine learning purposes.</p>
      <p>6.</p>
    </sec>
    <sec id="sec-6">
      <title>Bibliographical References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Alm</surname>
            ,
            <given-names>C. O.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Sproat</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Emotional sequencing and development in fairy tales</article-title>
          .
          <source>In International Conference on Affective Computing and Intelligent Interaction</source>
          (pp.
          <fpage>668</fpage>
          -
          <lpage>674</lpage>
          ). Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Alm</surname>
            ,
            <given-names>C. O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sproat</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Emotions from text: machine learning for text-based emotion prediction</article-title>
          .
          <source>In Proceedings of the conference on human language technology and empirical methods in natural language processing</source>
          (pp.
          <fpage>579</fpage>
          -
          <lpage>586</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Bermingham</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Classifying sentiment in microblogs: is brevity an advantage?</article-title>
          .
          <source>In Proceedings of the 19th ACM international conference on Information and knowledge management</source>
          (pp.
          <fpage>1833</fpage>
          -
          <lpage>1836</lpage>
          ). ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bosco</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Allisio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mussa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Patti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruffo</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanguinetti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sulis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Detecting happiness in Italian tweets: Towards an evaluation dataset for sentiment analysis in Felicitta</article-title>
          .
          <source>In Proc. of the 5th International Workshop on Emotion, Social Signals, Sentiment and Linked Opena Data</source>
          ,
          <source>ESSSLOD</source>
          (pp.
          <fpage>56</fpage>
          -
          <lpage>63</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Jannidis</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reger</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zehe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hettinger</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Analyzing Features for the Detection of Happy Endings in German Novels</article-title>
          .
          <source>arXiv preprint arXiv:1611</source>
          .
          <fpage>09028</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Kaji</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kitsuregawa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>Building lexicon for sentiment analysis from massive collection of HTML documents</article-title>
          .
          <source>In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Kakkonen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Kakkonen</surname>
            ,
            <given-names>G. G.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>SentiProfiler: creating comparable visual pro-files of sentimental content in texts</article-title>
          .
          <source>In Proceedings of Language Technologies for Digital Humanities and Cultural Heritage</source>
          (pp.
          <fpage>62</fpage>
          -
          <lpage>69</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Kao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>A computational analysis of style, affect, and imagery in contemporary poetry</article-title>
          .
          <source>In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature</source>
          (pp.
          <fpage>8</fpage>
          -
          <lpage>17</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Krippendorff</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Computing Krippendorff's Alpha-Reliability</article-title>
          . Retrieved from http://repository.upenn.edu/asc_papers/43
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Sentiment Analysis</article-title>
          .
          <source>Mining Opinions, Sentiments and Emotions</source>
          . New York: Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Mellmann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Literaturwissenschaftliche Emotionsforschung</article-title>
          . In: Rüdiger Zymner (Ed.): Handbuch Literarische Rhetorik. Berlin/Boston, 173-
          <fpage>192</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Meyer-Sickendiek</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Affektpoetik: eine Kulturgeschichte literarischer Emotionen</article-title>
          .
          <source>Würzburg: Königshausen &amp; Neumann.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>From once upon a time to happily ever after: Tracking emotions in novels and fairy tales</article-title>
          .
          <source>In Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage</source>
          ,
          <source>Social Sciences, and Humanities</source>
          (pp.
          <fpage>105</fpage>
          -
          <lpage>114</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Momtazi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Fine-grained German Sentiment Analysis on Social Media</article-title>
          . In LREC (pp.
          <fpage>1215</fpage>
          -
          <lpage>1220</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Mozetič</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grčar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Smailović</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Multilingual Twitter sentiment classifica-tion: The role of human annotators</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>11</volume>
          (
          <issue>5</issue>
          ),
          <year>e0155036</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Nalisnick</surname>
            ,
            <given-names>E. T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Baird</surname>
            ,
            <given-names>H. S.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Character-tocharacter sentiment analysis in shakespeare's plays</article-title>
          .
          <source>In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics</source>
          (pp.
          <fpage>479</fpage>
          -
          <lpage>483</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Vaithyanathan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Thumbs up?: sentiment classification using machine learning techniques</article-title>
          .
          <source>In Proceedings of the ACL-02 conference on Empirical methods in natural language processingVolume 10</source>
          (pp.
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Prabowo</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Thelwall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Sentiment analysis: A combined approach</article-title>
          .
          <source>Journal of Informetrics</source>
          ,
          <volume>3</volume>
          (
          <issue>2</issue>
          ),
          <fpage>143</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Refaee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Rieser</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>An Arabic Twitter Corpus for Subjectivity and Sentiment Analysis</article-title>
          .
          <source>In LREC</source>
          (pp.
          <fpage>2268</fpage>
          -
          <lpage>2273</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Saif</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Alani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Evaluation datasets for Twitter sentiment analysis: a survey and a new dataset, the STS-Gold</article-title>
          . In: 1st International Workshop on Emotion and
          <article-title>Sentiment in Social and Expressive Media: Approaches and Perspectives from AI (ESSEM</article-title>
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Schmidt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burghardt</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Dennerlein</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>"Kann man denn auch nicht lachend sehr ernsthaft sein?" - Zum Einsatz von Sentiment Analyse-Verfahren für die quantitative Untersuchung von Lessings Dramen</article-title>
          . In Book of Abstracts, DHd
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cattle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Annotation Scheme for Constructing Sentiment Corpus in Korean</article-title>
          . In PACLIC (pp.
          <fpage>181</fpage>
          -
          <lpage>190</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Sprugnoli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tonelli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marchetti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Moretti</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Towards sentiment analysis for historical texts</article-title>
          .
          <source>Digital Scholarship in the Humanities</source>
          ,
          <volume>31</volume>
          (
          <issue>4</issue>
          ),
          <fpage>762</fpage>
          -
          <lpage>772</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Takala</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sinha</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ahlgren</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Gold-standard for Topic-specific Sentiment Analysis of Economic Texts</article-title>
          . In
          <string-name>
            <surname>LREC</surname>
          </string-name>
          (Vol.
          <year>2014</year>
          , pp.
          <fpage>2152</fpage>
          -
          <lpage>2157</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Thet</surname>
            ,
            <given-names>T. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Na</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Khoo</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Aspect-based sentiment analysis of movie reviews on discussion boards</article-title>
          .
          <source>Journal of information science</source>
          ,
          <volume>36</volume>
          (
          <issue>6</issue>
          ),
          <fpage>823</fpage>
          -
          <lpage>848</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Vinodhini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Chandrasekaran</surname>
            ,
            <given-names>R. M.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Sentiment analysis and opinion mining: a survey</article-title>
          .
          <source>International Journal of Advanced Research in Computer Science and Software Engineering</source>
          ,
          <volume>2</volume>
          (
          <issue>6</issue>
          ),
          <fpage>282</fpage>
          -
          <lpage>292</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Volkova</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohler</surname>
            ,
            <given-names>B. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meurers</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerdemann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Bülthoff</surname>
            ,
            <given-names>H. H.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Emotional perception of fairy tales: achieving agreement in emotion annotation of text</article-title>
          .
          <source>In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text</source>
          (pp.
          <fpage>98</fpage>
          -
          <lpage>106</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Wiebe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <given-names>T.</given-names>
            , &amp;
            <surname>Cardie</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Annotating expressions of opinions and emotions in language</article-title>
          .
          <source>Language resources and evaluation</source>
          ,
          <volume>39</volume>
          (
          <issue>2</issue>
          ),
          <fpage>165</fpage>
          -
          <lpage>210</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>Winko</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2003</year>
          ).
          <article-title>Über Regeln emotionaler Bedeutung in und von literarischen Texten</article-title>
          . In: Fotis Jannidis &amp;
          <article-title>Gerhard Lauer &amp; Matias Martinez &amp; SW (eds</article-title>
          .):
          <article-title>Regeln der Bedeutung</article-title>
          .
          <source>Zur Theorie der Bedeutung literarischer Texte</source>
          . Berlin, New York: de Gruyter,
          <volume>329</volume>
          -
          <fpage>348</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>