The reader’s feeling and text-based emotions: The relationship
 between subjective self-reports, lexical ratings, and sentiment
                             analysis
        Egon Werlen1                                                                  Fernando Benites2
     egon.werlen@ffhs.ch                                                                 benf@zhaw.ch

        Christof Imhof1                                                               Per Bergamin1
    christof.imhof@ffhs.ch                                                         per.bergamin@ffhs.ch
                      1
                          Swiss Distance University of Applied Sciences (FFHS)
                            2
                              Zurich University of Applied Sciences (ZHAW)


                      Abstract                                sessment of emotional reactions to these texts, be
                                                              it individually or in combination.
    In this study, we examined how precisely
    a sentiment analysis and a word list-based                2     Theoretical background
    lexical analysis predict the emotional va-
    lence (as positive or negative emotional                  In the late nineties, Barrett and Russell (1999) de-
    states) of 63 emotional short stories. Both               veloped the circumplex model, a model of emo-
    the sentiment analysis and the word list-                 tions with two dimensions; emotional valence and
    based analysis predicted subjective va-                   emotional arousal. Emotional valence is the expe-
    lence, which however was predicted even                   rience of one’s own actual positive or negative feel-
    more precisely when both analysis meth-                   ing. Emotional arousal is the subjective amount of
    ods were combined. These results can,                     internal activation or energy. Together, these two
    for example, contribute to the development                dimensions form the core affect, “the most elemen-
    of new technology-based teaching designs,                 tary consciously accessible affective feelings that
    in that positive or negative emotions in                  need not be directed at anything” (S. 806). The
    the texts or online-contributions of stu-                 circumplex model provided the theoretical basis for
    dents can be assessed in automated form                   the present work.
    and transferred into instructional mea-                   Emotional valence, based on the circumplex model,
    sures. Such instructional actions can, for                was measured on a bipolar scale, ranging form very
    example, be hints, learning support or                    negative to very positive. This method was origi-
    feedback adapted to the students’ emo-                    nally conceived by Wundt (1896) and is the most
    tional state.                                             commonly used method to date. However, like the
                                                              sentiment analysis used in this study, some theo-
1    Introduction                                             ries view valence as a bivariate construct (e.g. Nor-
                                                              ris et al., 2010; Briesemeister et al., 2012; Shuman
There has been great progress in technology-based             et al., 2013; Kron et al., 2015). According to those
learning in recent decades. Methods and proce-                views, humans can perceive objects (e.g. images,
dures of learning analytics have recently played an           words, texts) as positive and negative at the same
important role here. In principle, learning analyt-           time, enabling them to have an ambiguous quality.
ics is about collecting data from students during             This highlights that emotion measurements are a
learning and using it to improve teaching. Despite            challenging and debated task (see also e.g. Mauss
progress in Natural Language Processing (NLP),                and Robinson, 2009).
texts or contributions from students have rarely
been used as a source of information for learn-               2.1    Subjective measurement by
ing analytics or for technology-based learning(e.g.                  self-reporting
Shibani, 2017). In this article, we used a small              Today, research assumes that individual measure-
corpus of texts with 900 to 1100 characters each in           ments cannot capture the phenomenon of emotions
the form of emotional short stories to find out to            entirely. This leads to the practice of using multi-
what extent it is possible to automatically capture           ple measuring methods in scientific investigations,
emotions as positive or negative emotional colour-            often in conjunction. Self-reports such as question-
ing of texts. The aim of this article is to assess how        naires or single item questions are a popular way
well two different methods of automatic capturing             of measuring emotions, and for good reasons: they
of emotions in texts predicted the subjective as-             have good validity (as long as response biases are


    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
taken into account) and enable quick and simple         open the file a password must be requested. The
data collection. Non-verbal alternatives to mea-        BAWL-R enables estimations of the emotional po-
sure emotions can also be used, such as the Self-       tential for single words but also extrapolations for
Assessment Manikins (SAM scale) from Bradley            sentences and whole texts.
and Lang (1994), measuring feelings, i.e. the sub-      The BAWL-R specifically has been utilized for this
jective experience of emotions. This instrument         purpose as well: Aryani et al. (2015) analysed po-
contains visual rather than verbal stimuli (i.e. pic-   ems, Lehne et al. (2015) examined E.T.A. Hoff-
tures rather than questions), which consist of ab-      mann’s black-romantic story ”The Sandman”, Hsu
stract representations of a human being displaying      et al. (2015) analysed passages of Harry Potter nov-
different emotions. The scale varies in three di-       els, and Jacobs and Kinder (2017) inquired poten-
mensions; valence, arousal, and dominance. The          tially relevant properties of Skakespear’s sonnets.
valence dimension shows pictures ranging from a         These studies found out that affective word rat-
smiling face to a frowning face, with more neutral      ings correlated with whole text ratings and came
expressions in-between; and in the arousal dimen-       to the conclusion that a text’s constituting words
sion, pictures range from a sleepy and calm figure      can predict its emotional potential. Studies us-
to a wide-eyed, excited expression. We did not use      ing the BAWL-R to predict subjective valence of
the dominance dimension that represents the con-        short texts (Hsu et al., 2015) and poems (Ullrich
trolling and dominant nature of emotion shown by        et al., 2017) with lexical valence found correlations
a tiny figure in the middle of a square for low dom-    of r = .58 (short texts) and r = .65 (poems).
inance towards a oversize figure going beyond the       Since about the year 2000, research on sentiment
borders of the square for high dominance. Raters        and, as a consequence thereof, the term sentiment
were instructed to choose the image that best rep-      analysis appeared in scientific literature of compu-
resents their own current emotional state.              tational science with increased frequency e.g. (Na-
                                                        sukawa and Yi, 2003; Das and Chan, 2001). Liu
2.2    Objective measurement by lexical                 (2012) describes sentiment analysis as part of nat-
       ratings, and sentiment analysis                  ural language processing (NLP) that extracts peo-
Despite their popularity, self-reports are far from     ple’s emotions, sentiments, opinions etc. out of
the only instrument being used in affective sci-        spoken or written language. It focuses mainly on
ence. Lexical analysis (i.e. analysis based on sin-     positive and negative sentiments. Sentiment anal-
gle words) is a different, more objective instru-       ysis is a learning-based approach, that - in con-
ment which historically has been used significantly     trast to lexical analysis - does not necessarily rely
less often. In this regard, Jacobs et al. (2015)        on rated word lists and instead implements ma-
argues, based on long existing works by Freud           chine learning. Technically speaking, word-based
(1891) and Bühler (1934), that spoken or writ-         lexical analysis could be categorized as a semantic
ten words contain the potential to elicit both overt    approach to sentiment analysis that does not nec-
or covert sensu-motoric or affective reactions. In      essarily implement machine learning. Sentiment
this context we speak of embodied stimuli. Recent       analysis, also called opinion mining or polarity de-
neurological research supports this relationship as     tection, as explained by Fueyo (2018), ”refers to
demonstrated in Jacobs (2015). On the basis of          the set of AI algorithms and techniques used to
these, it can be explained that words can evoke         extract the polarity of a given document: whether
both basic and fictional emotions as well as some-      the document is positive, negative or neutral” that
thing like aesthetic feelings.                          is represented as classes or a probability. Angiani
Before neurological research pointed out these con-     et al. (2016) lists possible steps of a sentiment anal-
nections, there was a clear language-emotion gap,       ysis: 1) initialization step (data collection, data
i.e. most emotion theories ignored language func-       processing, attribute selection), 2) learning step
tions, while linguistic theories ignored affective      (algorithm, training model), and 3) evaluation step
processes. In order to bridge that gap, the Berlin      (test set).
Affective Word List (BAWL-R) was developed (Vo          The automatic sentiment analysis system used for
et al., 2009). The BAWL-R is a large German word        this paper is composed of two parts, namely the
list containing almost 3000 words (nouns, verbs,        model and the data. The multi-layered convolu-
and adjectives) from the CELEX database (Baayen         tional network model is the same as in Deriu et al.
et al., 1993), each rated on valence, arousal, and      (2017). The authors trained this network as shown
imageability. The list also includes psycholinguis-     in Figure 1 with a large number of tweets in dif-
tic factors (e.g. number of letters, phonemes, word     ferent languages that were weakly supervised, and
frequency, accent). It is free for download (1 ). To    demonstrated the importance of using pre-training
                                                        of such networks. The specific pre-training proce-
   1
    cf.         https://www.ewi-psy.fu-berlin.          dure, named distant-supervised learning, is trained
de/einrichtungen/arbeitsbereiche/allgpsy/
Download/BAWL/index.html accessed May 2019
on larger weakly or non-labelled samples2 . After-       correlation between emotions evaluated by human
wards the network is further trained on a much           raters and those found by algorithms was mod-
smaller data set with manually strongly labelled         erate with max. r = .48 (explanation of vari-
samples. The approach was evaluated on vari-             ance max. 24%). The largest effect was found
ous multi-lingual data sets, including the SemEval-      with valence analysed in a knowledge-based, area-
2016 sentiment prediction benchmark (Task 4),            independent, unsupervised CLaC approach. We
where it achieved state-of-the-art performance.          assume, as already mentioned above, that differ-
This model was trained on the SB10k German               ent measurements can cover certain aspects of the
Twitter sentiment corpus (Cieliebak et al., 2017),       complex phenomenon emotions that others do not.
which is a corpus for sentiment analysis with ap-        Different measurements often only reveal parts of
proximately 10,000 German tweets. Tweets are             a phenomena and might sometimes even be con-
normally a sentence long and are often connoted          tradictory. Thus, the combination of several mea-
with emotions. Although the domain is not the            surement techniques can prove to be fruitful.
same, the focus on sentence and on emotions is           In our study, we are interested in the combina-
very similar in the used data sets (train and test).     tion between lexical analysis, sentiment analysis
The used word embeddings were weakly trained on          and self-report and specifically, if prediction of the
40 millions German tweets. Here, emoticons were          latter improves when the former two are combined.
used for automatically labelling the emotional con-
tent of a tweet (positive, negative, neutral). Fi-       2.4   Hypotheses
nally, the output of the network is the confidence       As discussed above, it has been known for a long
(from 0 to 1) for each one of the three sentiments.      time (e.g. Freud (1891); Bühler (1934) that words
Both lexical and sentimental analyses have been          can trigger emotional reactions, which more re-
applied to different types of texts to measure their     cently has been confirmed in neurological studies
emotional potential in different contexts. Mossh-        Jacobs (2015). According to the circumplex model
older et al. (1995) analysed emotions in open-ended      of emotion by Barrett and Russell (1999), the emo-
survey responses by applying the Dictionary of Af-       tional valence, i.e. the personal appraisal whether
fect in Language (DAL); Loughran and McDonald            and how strongly something is perceived positively
(2015) used the Diction software in order to analyse     or negatively, is one of the most basic emotional re-
and categorize the tone of business documents such       actions. The emotional valence, i.e. the subjective
as financial reports; Humphreys and Wang (2017)          valence, of the 63 short texts was assessed by uni-
implemented automated text analysis for examin-          versity students rating their emotional responses to
ing text patterns in consumer research; Lima et al.      these texts (17-19 ratings per text). As explained
(2015) analysed Twitter messages within a polar-         above, the emotional valence of a text can also be
ity analysis framework, Whissell (2011) analysed         measured objectively, in our case with sentiment
Poe’s poetry and Whissell (1996) used the ”emo-          analysis and lexical analysis. We were interested
tion clock” to conduct a stylometric analysis of         in finding out if these automated objective mea-
Beatles songs, to name a few examples.                   surement approaches could predict the subjective
                                                         valence. If so, they could serve as an approxima-
2.3    Combination of different                          tion rather than relying on repeated self-reports of
       measurement procedures                            subjective ratings. This leads to the first hypothe-
A comparison of different procedures for emotion         sis.
recognition on the sentence level was conducted by       As several studies have shown (e.g. Aman (2007);
Aman (2007). He concluded that a combination of          Strapparava and Mihalcea (2010), combinations of
different automatic procedures for recording emo-        several methods for estimating emotions in texts
tions is advantageous. This finding is also sup-         lead to better predictions than one method alone.
ported in a paper by Strapparava and Mihalcea            This leads to the second hypothesis.
(2010). They tested several methods for automat-          1. The emotional valence measured by lexical
ically detecting emotions in short texts (headlines          analysis and by sentiment analysis each pre-
and blog posts; 100-400 characters). Six headline            dict the subjective valence of the short texts.
advisors rated the presence of six distinct emo-
tions as well as the valence of the texts, which          2. The combination of the measurements meth-
were then predicted by several procedures. The               ods (lexical analysis and sentiment analysis)
study found that ”different methods have differ-             increases the predictive power.
ent strengths, especially with respect to individ-
                                                           Despite the sizable amount of research in
ual emotions” (p. 35). Most interestingly, the
                                                         emotion and in text analysis, we are not aware
   2
    On twitter emoticons/emojis can be used as weak      of many studies that not only compared (e.g.
labels, for instance a tweet with a smiling emoji will   Nielsen, 2011; Hutto and Gilbert, 2014) but also
probably have a positive sentiment.                      combined both word-list-based lexical analysis and
sentiment analysis to predict subjective ratings       for the subjective rating of emotional valence
of emotional valence in short texts (e.g. Dhaoui       calculated with the R package irr by Gamer et al.
et al., 2017).                                         (2012) was .98 or more in each of the three groups.
                                                       The semantic lexical analysis of the text was
                                                       conducted with the revised form of the Berlin
3     Methods                                          Affective Word List BAWL-R (Vo et al., 2009)
                                                       in R (R Core Team, 2017), using the packages
3.1   Samples and measurements                         tidyverse (Wickham, 2017) and sylly (Michalke,
The 63 analysed texts originated from a collection     2018). In that list, valence had been rated on a
of 102 German texts written by 32 authors, 21          7-point Likert scale (-3 very negative through 0
of which were German speaking students and             neutral to +3 very positive). For each short story,
staff of an University in Germany (mean age            we averaged the valence of all the words in that
26.10, SD = 10.65; gender: 85% women), and             text represented in the BAWL-R.
11 of which were recruited by the first author          The automatic sentiment analysis was trained on
(university staff and people recruited via social      sentences. Nevertheless, we applied it to our short
media and personal contacts; mean age 36.82,           stories as a whole instead, since the subjective
SD = 15.78; 64% women). These 63 texts are             ratings we wanted to predict were on a text
part of an international database with over 200        rather than a sentence level. For this paper, we
emotional short stories which are developed            calculated a new overall valence variable for the
and refined within the framework of the COST           sentiment analysis data based on the negative and
initiative E-Read IS 1404 (Kaakinen et al., in         positive scores (negative sentiment minus positive
preparation). The international database contains      sentiment), assuming that the neutral sentiment
stories from Finland, France, Germany, Portugal,       had no influence on the positive or negative orien-
Spain, Switzerland, and Turkey. All stories are        tation of the analysis. The reason for this decision
subjectively rated on emotional valence, emotional     was that the three original variables sum up to 1
arousal and comprehensibility in their original        and are therefore interdependent. Consequently,
language and in English. All texts have a length       their individual effects on the subjective ratings
of 900 to 1100 characters including spaces. Texts      canceled each other out. In order to obtain values
that were not written in the first person were         comparable to the BAWL-R valence variable, this
rewritten without changing their content and           new variable was created. We further analyzed
structure. The topic varies from story to story,       the text in terms of readability. Readability was
some of them tell of joyful events and experiences     scored with the well established Flesch Index
(e.g. birth, love, music) or negative ones (e.g.       (Flesch, 1948), using a formula adapted to the
death, abuse). A few stories are emotionally           German language (Amstad, 1978). Means and
neutral, i.e. neither positive nor negative and with   standard deviations of the all used measures for
a medium level of emotional arousal. The stories       valence are reported in Table 1.
are mostly easy to understand. Once finished, the
database will be presented in a publication and
made freely accessible.                                  Valence      mean    SD      min    max     scale
The subjective valence rating of the texts was          Subjective    4.46    2.21   1.17    8.61    -3 - 3
conducted with the Self-Assessment-Manikin scale         Lexical      0.63    0.27   0.02    1.17     1-9
(SAM3 ) by Lang (1980). We used a modified              Sentiment     -0.44   0.26   -0.95   0.20    -1 - 1
9-point scale by Suk (2006). Participants were          Sentiment
instructed to rate the texts by choosing one of          positive      0.14   0.10    0.01   0.47    0-1
nine icons to represent their current emotional          negative      0.29   0.12    0.04   0.84    0-1
state. The 63 texts were rated on the survey
                                                       Table 1: Mean, standard deviation (SD), minimum
platform Qualtrics by 55 native German speaking
                                                       (min), maximum (max), and possible values (scale)
university students from different majors of a
                                                       of valence measured subjective (rated by students),
German University. The raters’ mean age was
                                                       with lexical analysis (BAWL-R), and with senti-
23.47 years (SD = 2.62), 90.9% were female.
                                                       ment analysis
Each participant rated a randomly predetermined
set of 21 texts in randomized order, so that
each text was evaluated by one of three groups
with each 17-19 participants. As compensation,         3.2   Analyses
participants had the chance to win one of fifteen
10 € Amazon vouchers. The inter-rater reliability      We chose to conduct our regression analyses
                                                       with a Bayesian approach, which has impor-
   3
     cf.   http://irtel.uni-mannheim.de/pxlab/         tant advantages over the traditional frequentist
demos/index_SAM.html accessed Feb. 2019                null hypothesis significance testing. Within the
                Figure 1: Convolutional Neural Network Model from Deriu et al. (2017)


Bayesian approach, the interpretation of data is       4   Results
not affected by sampling intention. In contrast to
the frequentist approach, the Bayesian approach        The correlation between the sentiment value -
permits assessment of the relative credibility of      calculated as the difference between negative
parameter values given the data and the statistical    and positive sentiment - and the lexical value
model (Kruschke, 2010). The statistical analyses       of valence was r = .50 (95% Credible Interval -
were conducted with R version 3.3.4 (R Core            CrI = [.28; .72]). Both of them had a moderate
Team, 2017) and the R package brms version 2.4.0       positive correlation with the subjective valence
(Bürkner, 2018), which is a package for Bayesian      ratings of the texts (r = .51 (95% CrI = [.28; .72]
generalized multivariate non-linear multilevel         for sentiment; r = .62 (95% CrI = [.42; .82] for
models. To allow comparisons with other studies        lexical valence). There was a weak correlation
that correlated lexical or sentimental analysis        between the Flesch readability score and the other
with subjective ratings of texts, we calculated        three variables (sentiment: r = −.24 (95% CrI =
correlations of the standardized values averaged       [-.48; .01]; lexical valence: r = −.12 95% CrI =
over the 63 texts as beta values with brms. For the    [-.37; .12]; subjective valence: r = −.24 (95% CrI
multilevel models predicting subjective valence,       = [-.50; .01]).
the raw data of all 55 raters were included in the     A visual inspection of the MCMC chains and the
model with rater as a level 2 predictor. The re-       R-hat diagnostic with all R-hat values < 1.02
sulting sample included 1143 observations, i.e. 63     revealed good convergence for all estimated pa-
texts with an average of 18 raters. The predictors     rameters of all calculated models.
(sentiment, lexical valence, and Flesch Index) were    The restricted model (Model 0) including the
averaged for each of the 63 texts. The subjective      intercepts and the level 2 variable only had a
valence ratings - an ordinal scaled variable with      LOOic of 4969. Model 1 predicting subjective
values ranging from 1 to 9 - were modelled with        valence by sentiment had an effect of β = 4.60
a cumulative distribution. The Bayesian Credible       (95% CrI = [2.62; 6.56]). The LOOic was 4717.
Interval, meaning the range a certain value lies       Model 2 predicting subjective valence by lexical
within with a probability of 95% (thus not to be       valence had an effect of β = 5.37 (95% CrI =
confused with the frequentist Confidence Interval!)    [3.62; 7.01]) with a LOOic of 4578. To decide,
is reported for all results. Since this is the first   which model is to prefer, we relied on the credible
study in this context applying Bayesian analysis,      intervals of the LOOic. The credible intervals of
no informative priors were available. We thus          the LOOic of Model 1 and 2 did not overlap with
decided to use brms’ default priors. The Leave-        the LOOic of the restricted model 0 (see Table
One-Out Cross-Validation information criteria          2). That lead to our conclusion that both models
(LOOic) was used to compare the different models.      predicted subjective valence and that the first
The LOOic is a method ”for estimating pointwise        hypothesis could be confirmed.
out-of-sample prediction accuracy from a fitted
Bayesian model using the log-likelihood evaluated
at the posterior simulations of the parameter            Model 3 predicting subjective valence of texts
values” (p. 1413; (Vehtari et al., 2017)).             by sentiment and lexical valence (BAWL-R) is pre-
                                                       sented in Table 3. The design formula for model 3
                                                       was formulated as follows:
     Model       LOOic       se    CrI 5%       CrI 95% model 2, we consider the difference of their LOOic
      M0          4969       17     4935         5002   big enough to conclude that model 3, i.e. the
      M1          4717       36     4646         4787   prediction of subjective valence is better when
      M2          4578       39     4502         4654   sentiment and lexical analysis are combined than
      M3          4515       42     4433         4597   either one of them on their one. This confirmed
     M 3+         4494       42     4412         4577   the second hypothesis. The level 2 predictor (text
                                                        raters) in model 3 had a small but negligible effect
Table 2: LOO information criteria with standard         on subjective valence.
error and credible intervals (CrI)                      The integration of readability (Flesch-Index) did
                                                        not improve the model. The credible intervals
                                                        were mostly overlapping and the information gain
                                                        was negligible with a small effect of readability on
                                                        subjective valence (β = −0.04; 95% CrI = [-0.09;
                  Ri ∼ Ordered(p)       [likelihood]    -0.02]; LOOic = 4494). The LOOic, standard
                logit(pk ) = αk − φi [cumulative link errors, and credible intervals of the five models
                                                        are listed in Table 2.
                                    and linear model] In Figure 2, the prediction of subjective valence
φi = βBAW L BAW Li + βSent Senti [linear model]         by sentiment and lexical valence given the data
               αk ∼ Normal(0, 10) [common prior and the statistical model considering the effects of
                                    for each intercept] both predictors (model 3) is visualized. The figure
                                                        shows the slope (blue line) with its 95% gray
          βBAW L ∼ Normal(0, 10) [βBAW L prior] shadowed credible interval. Both predictors have
             βSent ∼ Normal(0, 10) [βSent prior]        a tight credible interval that does not include 0,
                                                        indicating clear positive effects for both predictors.

   Ri is the ordered distribution (i.e. a cate-
gorical distribution that takes the vector p =                      5   Discussion
{p1 , p2 , p3 , p4 , p5 , p6 , p7 , p8 } of probabilities of each
subjective valence rating value below the maxi-        The aim of this study was to compare different
mum category of 9). αk is the unique intercept         techniques to capture emotions in 63 short texts.
of each possible outcome value k, φi is the linear     Our investigations focused on the question whether
model that is subtracted from each intercept,          the prediction of readers’ subjectively appraised
βBAW L and βSent are the slopes of the BAWL-R          emotions towards texts improves when word list-
(lexical analysis) and sentiment values respectively   based lexical analysis and sentiment analysis are
and BAW Li and Senti are the values of both            both considered. The results confirmed our hy-
predictor variables on row i.                          potheses that lexical and sentiment analyses both
                                                       predict subjective valence independently or in com-
                                                       bination (hypothesis 1). The strongest effect re-
    Parameter      R̂   n eff     β     CrI 5% 95% sulted when both approaches were combined (hy-
  braterIntercept 1.01 1242 0.11         0.01     0.28 pothesis 2). That confirms Dhaoui et al. (2017)
    bsentiment    1.00 4000 2.07         1.58     2.59 and corresponds to findings of Aman (2007) and
     bvalence     1.00 4000 3.42         2.96     3.89 Strapparava and Mihalcea (2010) that combina-
                                                       tions of algorithms result in better predictions.
Table 3: Results of Bayesian linear regression anal-   Other studies predicting subjective valence rat-
ysis                                                   ings of texts with the same word list (BAWL-R)
                                                       found correlations of r = .58 (Hsu et al. (2015);
                                                       for short passages of the Harry Potter novels) and
   The subjective valence was predicted by sen-        r = .65 ((Ullrich et al., 2017); for poems of En-
timent with β = 2.07 (95% CrI = [1.58; 2.59]),         zensberger), which are in the same range as our
and by lexical valence with β = 3.42 (95% CrI          results (r = .62). There are studies for instance
= [2.96; 3.89]). The LOO information criteria of       Settanni and Marengo (2015) using other word list
model 3 (LOOic = 4515) was smaller than that of        (e.g. LIWC) with lower correlations between nega-
either of the other models. The credible interval      tive emotions expressed on Facebook posts and cor-
of model 1, but not of model 2 does not overlap        responding subjective negative emotions (r = .22,
with the credible interval of the combined model       for younger people r = .40). Similar is found for
3. We conclude that sentiment and lexical analysis     the sentiment analysis, where our results (r = .50)
predict subjective valence better than sentiment       correspond to results published in the literature.
analysis alone. Even if the credible interval of       Correlations of different algorithms with subjec-
model 3 overlaps with the credible interval of         tive valence ratings were reported for instance by
                Figure 2: Prediction of subjective valence by sentiment and lexical valence


Strapparava and Mihalcea (2010) in detecting sen-         texts has an impact on the emotions when reading
timent in headlines. The algorithm with the best          (e.g. Yin et al., 2014; Ben-David et al., 2016). To
predictive power was the CLaC system that ”re-            take this into account we investigated an additional
lies on a knowledge-based domain-independent un-          model 3+. One way to determine the text diffi-
supervised approach to headline valence detection         culty is with the Flesh-Index (Flesch, 1948; Am-
and scoring. The system uses three main kinds             stad, 1978). When this predictor was taken into
of knowledge: a list of sentiment-bearing words, a        account in the model, only a very small effect could
list of valence shifters and a set of rules that de-      be found. Therefore, we decided not to pursue this
fine the scope and the result of the combination of       additional variant any further. One reason for the
sentiment-bearing words and valence shifters” (p.         small effect of the Flesh-Index might be that when
28). This algorithm found a correlation of r = .48        selecting the texts at the beginning of the study we
for valence. The correlations with the other four         made sure that none of the texts used had extreme
algorithms were all below r = .40. In comparison          Flesch values in order to avoid biases of the mea-
to other studies, the sentiment analysis used in this     surement results due to comprehension problems.
study revealed a rather high correlation. A more          An explanation for the weaker performance of the
recent study (Preoţiuc-Pietro et al., 2016) found        sentiment analysis compared to the lexical rating
a higher correlation of r = .65 between sentiment         may be that the 63 analysed texts were part of an
analysis and subjective valence with a bag-of-words       international database with emotional short texts
linear regression model.                                  (Kaakinen et al., in preparation). In this context,
When both measurement techniques were com-                the emotional content of the entire text (not on
bined in a model, the effect of lexical valence pre-      the word or sentence level) was assessed by student
dicting subjective valence was stronger than the          raters. This differs from the method of sentiment
effect of the sentiment analysis. The β = 2.07 in         analysis, which was applied on each short story
model 3 for sentiment means that an increase of           but was trained on Twitter messages, i.e. sentence
1 SD in the sentiment values corresponds to an            level (Cieliebak et al., 2017). As in other studies,
increase of 2.07 SDs in the predicted subjective va-      the lexical analysis is based on the average valence
lence. Likewise, the β = 3.42 for lexical valence         of words that previously were evaluated by stu-
means that an increase of 1 SD in the lexical va-         dents (Vo et al., 2009). We assigned each word
lence values corresponds to an increase of 3.42 SDs       of the short texts, which was also included in the
in the predicted subjective valence. This stronger        Berlin Affective Word List, its valence value and
effect of lexical analysis was also visible in the cor-   averaged these values getting a mean value for each
relations of each variables with subjective valence.      short text. We assume that due to different aspects
Both predictors correlate with each other (r = .50),      of valence measured in the procedures mentioned,
and therefore share a good part of their variance.        the combination model and therefore the combina-
This explains that overlap between the credible in-       tion of the different aspects of valence measured
tervals of model 3 (sentiment and lexical analysis as     achieved the best prediction values. However, in
predictors) and model 2 (lexical analysis as predic-      order to actually confirm this assumption we need
tors). Nevertheless, we considered the information        to further investigate whether the correlations be-
gain of model 3 over model 2 to be large enough           tween the three measurements and the fit of the
and therefore favour model 3.                             different models remain at a lower taxonomic level,
It is known from literature that the difficulty of        i.e. at the sentence respectively word level instead
of the text level, and observe the predictive power        Acknowledgments
accordingly. Another question is whether the com-
                                                          A big ’Thank you’ to Yvonne Kammerer of the
bination of measurement techniques developed and
                                                         Leibnitz-Institut für Wissensmedien in Tübingen.
validated in a context other than short stories, such
                                                         She organized the collecting of the text from Ger-
as the sentiment analysis using tweets, is appropri-
                                                         many, and the rating of the 63 texts that were col-
ate, or whether it is better to use other techniques
                                                         lected as part of a project in the COST Action E-
developed in the same context. We are under the
                                                         READ. We also thank Mark Cieliebak and Jan De-
impression, that the sentiment analysis applied in
                                                         riu for providing the sentiment prediction system
this study did a pretty good job compared to other
                                                         and the helpful discussions, and Stéphanie McGar-
procedures.
                                                         rity for proofreading and useful suggestions.
There are some aspects to our approach that we
did not account for in this study that may be worth
exploring in future studies. One such aspect is the
                                                         References
perceived difficulty of the rating task since the sub-
jective ratings may be biased if the task is thought     Saima Aman. 2007. Recognizing emotions in text.
to be either particularly easy or particularly diffi-      Ph.D. thesis, University of Ottawa (Canada).
cult. This concerns both the text ratings as well
as the ratings that resulted in the two analysis ap-     T Amstad. 1978. Wie verständlich sind unsere
                                                           zeitungen?[how understandable are our newspa-
proaches that rely on subjective expert ratings at
                                                           pers?]. Unpublished doctoral dissertation, Uni-
their very core. Another aspect worthy of inspec-          versity of Zürich, Switzerland .
tion is the discrepancy between human ratings and
both analysis approaches since they do not neces-        Giulio Angiani, Laura Ferrari, Tomaso Fontanini,
sarily align at all times. Exploring under which cir-      Paolo Fornacciari, Eleonora Iotti, Federico
cumstances they diverge may prove to be a promis-          Magliani, and Stefano Manicardi. 2016. A com-
ing venture.                                               parison between preprocessing techniques for
                                                           sentiment analysis in twitter. In KDWeb.
6   Conclusions                                          A Aryani, M Kraxenberger, S Ullrich, AM Jacobs,
The results indicate that lexical and sentiment            and M Conrad. 2015. Measuring the basic af-
                                                           fective tone in poetry using phonological iconic-
analyses predict subjective appraisal of emotions
                                                           ity and subsyllabic salience. Psychol. Aesthet.
triggered by short texts. The two methods are not          Creat. Arts .
redundant. It is therefore worthwhile analyzing
the emotional potential of texts applying both           R Harald Baayen, Richard Piepenbrock, and
measurement procedures. A next step is to repeat          H Van Rijn. 1993. The celex lexical database
these analyses on sentence and on word level to           (cd-rom). linguistic data consortium. Philadel-
check whether we get an even stronger predictive          phia, PA: University of Pennsylvania .
power. We also need to examine the integration
                                                         Lisa Feldman Barrett and James A Russell. 1999.
of other text properties, including other semantic
                                                           The structure of current affect: Controversies
parameters, into our analysis, as done by Jacobs           and emerging consensus. Current directions in
and Kinder (2017). The small effect gain of the            psychological science 8(1):10–14.
Flesch-Index can be interpreted as an indication
that non-emotional text properties could play a          Boaz M Ben-David, Maroof I Moral, Aravind K
role in the perception of emotions in a text.              Namasivayam, Hadas Erel, and Pascal HHM
The results found, for example, can contribute to          van Lieshout. 2016. Linguistic and emotional-
the development of new instructional designs that          valence characteristics of reading passages for
measure emotional appraisals of students engaged           clinical use and research. Journal of fluency dis-
                                                           orders 49:1–12.
in digital learning tasks. Positive or negative
emotions in the texts or online-contributions of         Margaret M Bradley and Peter J Lang. 1994. Mea-
students can be assessed in automated form and            suring emotion: The Self-Assessment Manikin
transferred into instructional measures, and thus         and the semantic differential. Journal of Be-
help to integrate automated learning support into         havioral Therapy and Experimental Psychiatry
feedback, hints or adaptive instructional design.         25(1):49–59.
We need even more predictive power for useful
integration of such sensors, i.e. measurements           Benny B Briesemeister, Lars Kuchinke, and
                                                           Arthur M Jacobs. 2012. Emotional valence: A
of emotional or affective properties of texts in
                                                           bipolar continuum or two independent dimen-
digital learning, in educational practice. From our        sions? SAGE Open 2(4):2158244012466558.
point of view, this can be achieved by combining
different measurement methods                            Karl Bühler. 1934. Sprachtheorie (language the-
                                                          ory). Stuttgart: G. Fischer .
Paul-Christian Bürkner. 2018. Advanced Bayesian      Arthur M Jacobs and Annette Kinder. 2017. “the
  multilevel modeling with the R package brms.          brain is the prisoner of thought”: A machine-
  The R Journal 10(1):395–411.                          learning assisted quantitative narrative analysis
                                                        of literary metaphors for use in neurocognitive
Mark Cieliebak, Jan Milan Deriu, Dominic Egger,         poetics. Metaphor and Symbol 32(3):139–160.
 and Fatih Uzdilli. 2017. A twitter corpus and
 benchmark resources for german sentiment anal-       Arthur M Jacobs, Melissa L-H Võ, Benny B Briese-
 ysis. In 5th International Workshop on Natural         meister, Markus Conrad, Markus J Hofmann,
 Language Processing for Social Media, Boston,          Lars Kuchinke, Jana Lüdtke, and Mario Braun.
 MA, USA, December 11, 2017 . Association for           2015. 10 years of bawling into affective and aes-
 Computational Linguistics, pages 45–51.                thetic processes in reading: what are the echoes?
                                                        Frontiers in psychology 6:714.
Sanjiv Das and M Chan. 2001. Extracting mar-
  ket sentiment from stock message boards. Asia       Johanna K Kaakinen, Egon Werlen, Yvonne Kam-
  Pacific Finance Association 2001.                     merer, S Ruiz-Fernandez, Cengiz Acartürk,
                                                        Xavier Aparicio, Thierry Baccino, Ugo Bal-
Jan Deriu, Aurelien Lucchi, Valeria De Luca, Ali-       lenghein, Per Bergamin, Nuria Castells Gomez,
  aksei Severyn, Simon Müller, Mark Cieliebak,         Armanda Costa, Isabel Falé, Olga Megalakaki,
  Thomas Hofmann, and Martin Jaggi. 2017.               and M Minguela. in preparation. Emotional
  Leveraging Large Amounts of Weakly Super-             text database [working title]. Behavior research
  vised Data for Multi-Language Sentiment Clas-         methods .
  sification. In WWW 2017 - International World
  Wide Web Conference. Perth, Australia.              Assaf Kron, Maryna Pilkiw, Jasmin Banaei, Ariel
                                                        Goldstein, and Adam Keith Anderson. 2015.
Chedia Dhaoui, Cynthia M Webster, and Lay Peng          Are valence and arousal separable in emotional
  Tan. 2017. Social media sentiment analysis: lex-      experience? Emotion 15(1):35.
  icon versus machine learning. Journal of Con-
  sumer Marketing 34(6):480–488.                      John K Kruschke. 2010. What to believe: Bayesian
                                                        methods for data analysis. Trends in cognitive
Rudolph Flesch. 1948. A new readability yardstick.      sciences 14(7):293–300.
  Journal of applied psychology 32(3):221.
                                                      PJ Lang. 1980.        Self-assessment manikin.
Sigmund Freud. 1891.         Zur auffassung der         Gainesville, FL: The Center for Research in
  aphasien: eine kritische studie. F. Deuticke.         Psychophysiology, University of Florida .
Enrique    Fueyo.    2018.         Understanding      Moritz Lehne, Philipp Engel, Martin Rohrmeier,
  what     is    behind     sentiment     analysis     Winfried Menninghaus, Arthur M Jacobs, and
  (part i).     Last accessed 6 March 2019.            Stefan Koelsch. 2015. Reading a suspenseful lit-
  https://building.lang.ai/understanding-what-is-      erary text activates brain areas related to social
  behind-sentiment-analysis-part-i-eaf1bcb43d2d.       cognition and predictive inference. PLoS One
Matthias Gamer, Jim Lemon, Maintainer Matthias         10(5):e0124550.
 Gamer, A Robinson, and W Kendall’s. 2012.            Ana Carolina ES Lima, Leandro Nunes de Castro,
 Package ‘irr’. Various coefficients of interrater     and Juan M Corchado. 2015. A polarity analysis
 reliability and agreement .                           framework for twitter messages. Applied Mathe-
Chun-Ting Hsu, Arthur M Jacobs, Francesca MM           matics and Computation 270:756–767.
  Citron, and Markus Conrad. 2015. The emo-           Bing Liu. 2012. Sentiment analysis and opinion
  tion potential of words and passages in reading       mining. Synthesis lectures on human language
  harry potter–an fmri study. Brain and language        technologies 5(1):1–167.
  142:96–114.
                                                      Tim Loughran and Bill McDonald. 2015. The use
Ashlee Humphreys and Rebecca Jen-Hui Wang.              of word lists in textual analysis. Journal of Be-
  2017.     Automated text analysis for con-            havioral Finance 16(1):1–11.
  sumer research. Journal of Consumer Research
  44(6):1274–1306.                                    Iris B Mauss and Michael D Robinson. 2009. Mea-
                                                         sures of emotion: A review. Cognition and emo-
Clayton J Hutto and Eric Gilbert. 2014. Vader:           tion 23(2):209–237.
  A parsimonious rule-based model for sentiment
  analysis of social media text. In Eighth inter-     Meik Michalke. 2018. sylly: Hyphenation and Syl-
  national AAAI conference on weblogs and social       lable Counting for Text Analysis. (Version 0.1-
  media.                                               5). https://reaktanz.de/?c=hacking&s=sylly.
Arthur M Jacobs. 2015. Neurocognitive poet-           Kevin W Mossholder, Randall P Settoon, Stan-
  ics: methods and models for investigating the        ley G Harris, and Achilles A Armenakis. 1995.
  neuronal and cognitive-affective bases of litera-    Measuring emotion in open-ended survey re-
  ture reception. Frontiers in human neuroscience      sponses: An application of textual data analysis.
  9:186.                                               Journal of management 21(2):335–355.
Tetsuya Nasukawa and Jeonghee Yi. 2003. Senti-          Melissa LH Vo, Markus Conrad, Lars Kuchinke,
  ment analysis: Capturing favorability using nat-       Karolina Urton, Markus J Hofmann, and
  ural language processing. In Proceedings of the        Arthur M Jacobs. 2009. The berlin affective
  2nd international conference on Knowledge cap-         word list reloaded (bawl-r). Behavior research
  ture. ACM, pages 70–77.                                methods 41(2):534–538.
Finn Årup Nielsen. 2011. A new anew: Evalua-           Cynthia Whissell. 1996. Traditional and emotional
  tion of a word list for sentiment analysis in mi-       stylometric analysis of the songs of beatles paul
  croblogs. arXiv preprint arXiv:1103.2903 .              mccartney and john lennon. Computers and the
                                                          Humanities 30(3):257–265.
Catherine J Norris, Jackie Gollan, Gary G
  Berntson, and John T Cacioppo. 2010. The cur-         Cynthia Whissell. 2011. To those who feel rather
  rent status of research on the structure of evalua-     than to those who think: Sound and emotion
  tive space. Biological psychology 84(3):422–436.        in poes poetry. International Journal of English
                                                          and Literature 2(6):149–156.
Daniel Preoţiuc-Pietro, H Andrew Schwartz, Gre-
 gory Park, Johannes Eichstaedt, Margaret Kern,         Hadley Wickham. 2017. The tidyverse. R package
 Lyle Ungar, and Elisabeth Shulman. 2016. Mod-           ver. 1.1. 1 .
 elling valence and arousal in facebook posts. In
                                                        Wilhelm Wundt. 1896. Grundriss der Psychologie.
 Proceedings of the 7th Workshop on Computa-
                                                         Engelmann.
 tional Approaches to Subjectivity, Sentiment and
 Social Media Analysis. pages 9–15.                     Guopeng Yin, Qingyuan Zhang, and Yimeng Li.
                                                         2014. Effects of emotional valence and arousal
R Core Team. 2017. A language and environment            on consumer perceptions of online review help-
  for statistical computing http://www. r-project.       fulness. In Twentieth Americas Conference on
  org.                                                   Information Systems. Savannah, USA.
Michele Settanni and Davide Marengo. 2015. Shar-
 ing feelings online: studying emotional well-
 being via automated text analysis of facebook
 posts. Frontiers in psychology 6:1045.
Antonette Shibani. 2017. Combining automated
 and peer feedback for effective learning design
 in writing practices. In ICCE 2017-25th In-
 ternational Conference on Computers in Edu-
 cation: Technology and Innovation: Computer-
 Based Educational Systems for the 21st Century,
 Doctoral Student Consortia Proceedings.
Vera Shuman, David Sander, and Klaus R Scherer.
  2013. Levels of valence. Frontiers in Psychology
  4:261.
Carlo Strapparava and Rada Mihalcea. 2010. An-
  notating and identifying emotions in text. In
  Intelligent Information Access, Springer, pages
  21–38.
Hyeon-Jeong Suk. 2006. Color and Emotion-a
  study on the affective judgment across media and
  in relation to visual stimuli. Ph.D. thesis, Uni-
  versität Mannheim.
Susann Ullrich, Arash Aryani, Maria Kraxen-
  berger, Arthur M Jacobs, and Markus Conrad.
  2017. On the relation between the general af-
  fective meaning and the basic sublexical, lexical,
  and inter-lexical features of poetic texts—a case
  study using 57 poems of hm enzensberger. Fron-
  tiers in psychology 7:2073.
Aki Vehtari, Andrew Gelman, and Jonah Gabry.
 2017. Practical bayesian model evaluation using
 leave-one-out cross-validation and waic. Statis-
 tics and Computing 27(5):1413–1432.