The reader’s feeling and text-based emotions: The relationship between subjective self-reports, lexical ratings, and sentiment analysis Egon Werlen1 Fernando Benites2 egon.werlen@ffhs.ch benf@zhaw.ch Christof Imhof1 Per Bergamin1 christof.imhof@ffhs.ch per.bergamin@ffhs.ch 1 Swiss Distance University of Applied Sciences (FFHS) 2 Zurich University of Applied Sciences (ZHAW) Abstract sessment of emotional reactions to these texts, be it individually or in combination. In this study, we examined how precisely a sentiment analysis and a word list-based 2 Theoretical background lexical analysis predict the emotional va- lence (as positive or negative emotional In the late nineties, Barrett and Russell (1999) de- states) of 63 emotional short stories. Both veloped the circumplex model, a model of emo- the sentiment analysis and the word list- tions with two dimensions; emotional valence and based analysis predicted subjective va- emotional arousal. Emotional valence is the expe- lence, which however was predicted even rience of one’s own actual positive or negative feel- more precisely when both analysis meth- ing. Emotional arousal is the subjective amount of ods were combined. These results can, internal activation or energy. Together, these two for example, contribute to the development dimensions form the core affect, “the most elemen- of new technology-based teaching designs, tary consciously accessible affective feelings that in that positive or negative emotions in need not be directed at anything” (S. 806). The the texts or online-contributions of stu- circumplex model provided the theoretical basis for dents can be assessed in automated form the present work. and transferred into instructional mea- Emotional valence, based on the circumplex model, sures. Such instructional actions can, for was measured on a bipolar scale, ranging form very example, be hints, learning support or negative to very positive. This method was origi- feedback adapted to the students’ emo- nally conceived by Wundt (1896) and is the most tional state. commonly used method to date. However, like the sentiment analysis used in this study, some theo- 1 Introduction ries view valence as a bivariate construct (e.g. Nor- ris et al., 2010; Briesemeister et al., 2012; Shuman There has been great progress in technology-based et al., 2013; Kron et al., 2015). According to those learning in recent decades. Methods and proce- views, humans can perceive objects (e.g. images, dures of learning analytics have recently played an words, texts) as positive and negative at the same important role here. In principle, learning analyt- time, enabling them to have an ambiguous quality. ics is about collecting data from students during This highlights that emotion measurements are a learning and using it to improve teaching. Despite challenging and debated task (see also e.g. Mauss progress in Natural Language Processing (NLP), and Robinson, 2009). texts or contributions from students have rarely been used as a source of information for learn- 2.1 Subjective measurement by ing analytics or for technology-based learning(e.g. self-reporting Shibani, 2017). In this article, we used a small Today, research assumes that individual measure- corpus of texts with 900 to 1100 characters each in ments cannot capture the phenomenon of emotions the form of emotional short stories to find out to entirely. This leads to the practice of using multi- what extent it is possible to automatically capture ple measuring methods in scientific investigations, emotions as positive or negative emotional colour- often in conjunction. Self-reports such as question- ing of texts. The aim of this article is to assess how naires or single item questions are a popular way well two different methods of automatic capturing of measuring emotions, and for good reasons: they of emotions in texts predicted the subjective as- have good validity (as long as response biases are Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). taken into account) and enable quick and simple open the file a password must be requested. The data collection. Non-verbal alternatives to mea- BAWL-R enables estimations of the emotional po- sure emotions can also be used, such as the Self- tential for single words but also extrapolations for Assessment Manikins (SAM scale) from Bradley sentences and whole texts. and Lang (1994), measuring feelings, i.e. the sub- The BAWL-R specifically has been utilized for this jective experience of emotions. This instrument purpose as well: Aryani et al. (2015) analysed po- contains visual rather than verbal stimuli (i.e. pic- ems, Lehne et al. (2015) examined E.T.A. Hoff- tures rather than questions), which consist of ab- mann’s black-romantic story ”The Sandman”, Hsu stract representations of a human being displaying et al. (2015) analysed passages of Harry Potter nov- different emotions. The scale varies in three di- els, and Jacobs and Kinder (2017) inquired poten- mensions; valence, arousal, and dominance. The tially relevant properties of Skakespear’s sonnets. valence dimension shows pictures ranging from a These studies found out that affective word rat- smiling face to a frowning face, with more neutral ings correlated with whole text ratings and came expressions in-between; and in the arousal dimen- to the conclusion that a text’s constituting words sion, pictures range from a sleepy and calm figure can predict its emotional potential. Studies us- to a wide-eyed, excited expression. We did not use ing the BAWL-R to predict subjective valence of the dominance dimension that represents the con- short texts (Hsu et al., 2015) and poems (Ullrich trolling and dominant nature of emotion shown by et al., 2017) with lexical valence found correlations a tiny figure in the middle of a square for low dom- of r = .58 (short texts) and r = .65 (poems). inance towards a oversize figure going beyond the Since about the year 2000, research on sentiment borders of the square for high dominance. Raters and, as a consequence thereof, the term sentiment were instructed to choose the image that best rep- analysis appeared in scientific literature of compu- resents their own current emotional state. tational science with increased frequency e.g. (Na- sukawa and Yi, 2003; Das and Chan, 2001). Liu 2.2 Objective measurement by lexical (2012) describes sentiment analysis as part of nat- ratings, and sentiment analysis ural language processing (NLP) that extracts peo- Despite their popularity, self-reports are far from ple’s emotions, sentiments, opinions etc. out of the only instrument being used in affective sci- spoken or written language. It focuses mainly on ence. Lexical analysis (i.e. analysis based on sin- positive and negative sentiments. Sentiment anal- gle words) is a different, more objective instru- ysis is a learning-based approach, that - in con- ment which historically has been used significantly trast to lexical analysis - does not necessarily rely less often. In this regard, Jacobs et al. (2015) on rated word lists and instead implements ma- argues, based on long existing works by Freud chine learning. Technically speaking, word-based (1891) and Bühler (1934), that spoken or writ- lexical analysis could be categorized as a semantic ten words contain the potential to elicit both overt approach to sentiment analysis that does not nec- or covert sensu-motoric or affective reactions. In essarily implement machine learning. Sentiment this context we speak of embodied stimuli. Recent analysis, also called opinion mining or polarity de- neurological research supports this relationship as tection, as explained by Fueyo (2018), ”refers to demonstrated in Jacobs (2015). On the basis of the set of AI algorithms and techniques used to these, it can be explained that words can evoke extract the polarity of a given document: whether both basic and fictional emotions as well as some- the document is positive, negative or neutral” that thing like aesthetic feelings. is represented as classes or a probability. Angiani Before neurological research pointed out these con- et al. (2016) lists possible steps of a sentiment anal- nections, there was a clear language-emotion gap, ysis: 1) initialization step (data collection, data i.e. most emotion theories ignored language func- processing, attribute selection), 2) learning step tions, while linguistic theories ignored affective (algorithm, training model), and 3) evaluation step processes. In order to bridge that gap, the Berlin (test set). Affective Word List (BAWL-R) was developed (Vo The automatic sentiment analysis system used for et al., 2009). The BAWL-R is a large German word this paper is composed of two parts, namely the list containing almost 3000 words (nouns, verbs, model and the data. The multi-layered convolu- and adjectives) from the CELEX database (Baayen tional network model is the same as in Deriu et al. et al., 1993), each rated on valence, arousal, and (2017). The authors trained this network as shown imageability. The list also includes psycholinguis- in Figure 1 with a large number of tweets in dif- tic factors (e.g. number of letters, phonemes, word ferent languages that were weakly supervised, and frequency, accent). It is free for download (1 ). To demonstrated the importance of using pre-training of such networks. The specific pre-training proce- 1 cf. https://www.ewi-psy.fu-berlin. dure, named distant-supervised learning, is trained de/einrichtungen/arbeitsbereiche/allgpsy/ Download/BAWL/index.html accessed May 2019 on larger weakly or non-labelled samples2 . After- correlation between emotions evaluated by human wards the network is further trained on a much raters and those found by algorithms was mod- smaller data set with manually strongly labelled erate with max. r = .48 (explanation of vari- samples. The approach was evaluated on vari- ance max. 24%). The largest effect was found ous multi-lingual data sets, including the SemEval- with valence analysed in a knowledge-based, area- 2016 sentiment prediction benchmark (Task 4), independent, unsupervised CLaC approach. We where it achieved state-of-the-art performance. assume, as already mentioned above, that differ- This model was trained on the SB10k German ent measurements can cover certain aspects of the Twitter sentiment corpus (Cieliebak et al., 2017), complex phenomenon emotions that others do not. which is a corpus for sentiment analysis with ap- Different measurements often only reveal parts of proximately 10,000 German tweets. Tweets are a phenomena and might sometimes even be con- normally a sentence long and are often connoted tradictory. Thus, the combination of several mea- with emotions. Although the domain is not the surement techniques can prove to be fruitful. same, the focus on sentence and on emotions is In our study, we are interested in the combina- very similar in the used data sets (train and test). tion between lexical analysis, sentiment analysis The used word embeddings were weakly trained on and self-report and specifically, if prediction of the 40 millions German tweets. Here, emoticons were latter improves when the former two are combined. used for automatically labelling the emotional con- tent of a tweet (positive, negative, neutral). Fi- 2.4 Hypotheses nally, the output of the network is the confidence As discussed above, it has been known for a long (from 0 to 1) for each one of the three sentiments. time (e.g. Freud (1891); Bühler (1934) that words Both lexical and sentimental analyses have been can trigger emotional reactions, which more re- applied to different types of texts to measure their cently has been confirmed in neurological studies emotional potential in different contexts. Mossh- Jacobs (2015). According to the circumplex model older et al. (1995) analysed emotions in open-ended of emotion by Barrett and Russell (1999), the emo- survey responses by applying the Dictionary of Af- tional valence, i.e. the personal appraisal whether fect in Language (DAL); Loughran and McDonald and how strongly something is perceived positively (2015) used the Diction software in order to analyse or negatively, is one of the most basic emotional re- and categorize the tone of business documents such actions. The emotional valence, i.e. the subjective as financial reports; Humphreys and Wang (2017) valence, of the 63 short texts was assessed by uni- implemented automated text analysis for examin- versity students rating their emotional responses to ing text patterns in consumer research; Lima et al. these texts (17-19 ratings per text). As explained (2015) analysed Twitter messages within a polar- above, the emotional valence of a text can also be ity analysis framework, Whissell (2011) analysed measured objectively, in our case with sentiment Poe’s poetry and Whissell (1996) used the ”emo- analysis and lexical analysis. We were interested tion clock” to conduct a stylometric analysis of in finding out if these automated objective mea- Beatles songs, to name a few examples. surement approaches could predict the subjective valence. If so, they could serve as an approxima- 2.3 Combination of different tion rather than relying on repeated self-reports of measurement procedures subjective ratings. This leads to the first hypothe- A comparison of different procedures for emotion sis. recognition on the sentence level was conducted by As several studies have shown (e.g. Aman (2007); Aman (2007). He concluded that a combination of Strapparava and Mihalcea (2010), combinations of different automatic procedures for recording emo- several methods for estimating emotions in texts tions is advantageous. This finding is also sup- lead to better predictions than one method alone. ported in a paper by Strapparava and Mihalcea This leads to the second hypothesis. (2010). They tested several methods for automat- 1. The emotional valence measured by lexical ically detecting emotions in short texts (headlines analysis and by sentiment analysis each pre- and blog posts; 100-400 characters). Six headline dict the subjective valence of the short texts. advisors rated the presence of six distinct emo- tions as well as the valence of the texts, which 2. The combination of the measurements meth- were then predicted by several procedures. The ods (lexical analysis and sentiment analysis) study found that ”different methods have differ- increases the predictive power. ent strengths, especially with respect to individ- Despite the sizable amount of research in ual emotions” (p. 35). Most interestingly, the emotion and in text analysis, we are not aware 2 On twitter emoticons/emojis can be used as weak of many studies that not only compared (e.g. labels, for instance a tweet with a smiling emoji will Nielsen, 2011; Hutto and Gilbert, 2014) but also probably have a positive sentiment. combined both word-list-based lexical analysis and sentiment analysis to predict subjective ratings for the subjective rating of emotional valence of emotional valence in short texts (e.g. Dhaoui calculated with the R package irr by Gamer et al. et al., 2017). (2012) was .98 or more in each of the three groups. The semantic lexical analysis of the text was conducted with the revised form of the Berlin 3 Methods Affective Word List BAWL-R (Vo et al., 2009) in R (R Core Team, 2017), using the packages 3.1 Samples and measurements tidyverse (Wickham, 2017) and sylly (Michalke, The 63 analysed texts originated from a collection 2018). In that list, valence had been rated on a of 102 German texts written by 32 authors, 21 7-point Likert scale (-3 very negative through 0 of which were German speaking students and neutral to +3 very positive). For each short story, staff of an University in Germany (mean age we averaged the valence of all the words in that 26.10, SD = 10.65; gender: 85% women), and text represented in the BAWL-R. 11 of which were recruited by the first author The automatic sentiment analysis was trained on (university staff and people recruited via social sentences. Nevertheless, we applied it to our short media and personal contacts; mean age 36.82, stories as a whole instead, since the subjective SD = 15.78; 64% women). These 63 texts are ratings we wanted to predict were on a text part of an international database with over 200 rather than a sentence level. For this paper, we emotional short stories which are developed calculated a new overall valence variable for the and refined within the framework of the COST sentiment analysis data based on the negative and initiative E-Read IS 1404 (Kaakinen et al., in positive scores (negative sentiment minus positive preparation). The international database contains sentiment), assuming that the neutral sentiment stories from Finland, France, Germany, Portugal, had no influence on the positive or negative orien- Spain, Switzerland, and Turkey. All stories are tation of the analysis. The reason for this decision subjectively rated on emotional valence, emotional was that the three original variables sum up to 1 arousal and comprehensibility in their original and are therefore interdependent. Consequently, language and in English. All texts have a length their individual effects on the subjective ratings of 900 to 1100 characters including spaces. Texts canceled each other out. In order to obtain values that were not written in the first person were comparable to the BAWL-R valence variable, this rewritten without changing their content and new variable was created. We further analyzed structure. The topic varies from story to story, the text in terms of readability. Readability was some of them tell of joyful events and experiences scored with the well established Flesch Index (e.g. birth, love, music) or negative ones (e.g. (Flesch, 1948), using a formula adapted to the death, abuse). A few stories are emotionally German language (Amstad, 1978). Means and neutral, i.e. neither positive nor negative and with standard deviations of the all used measures for a medium level of emotional arousal. The stories valence are reported in Table 1. are mostly easy to understand. Once finished, the database will be presented in a publication and made freely accessible. Valence mean SD min max scale The subjective valence rating of the texts was Subjective 4.46 2.21 1.17 8.61 -3 - 3 conducted with the Self-Assessment-Manikin scale Lexical 0.63 0.27 0.02 1.17 1-9 (SAM3 ) by Lang (1980). We used a modified Sentiment -0.44 0.26 -0.95 0.20 -1 - 1 9-point scale by Suk (2006). Participants were Sentiment instructed to rate the texts by choosing one of positive 0.14 0.10 0.01 0.47 0-1 nine icons to represent their current emotional negative 0.29 0.12 0.04 0.84 0-1 state. The 63 texts were rated on the survey Table 1: Mean, standard deviation (SD), minimum platform Qualtrics by 55 native German speaking (min), maximum (max), and possible values (scale) university students from different majors of a of valence measured subjective (rated by students), German University. The raters’ mean age was with lexical analysis (BAWL-R), and with senti- 23.47 years (SD = 2.62), 90.9% were female. ment analysis Each participant rated a randomly predetermined set of 21 texts in randomized order, so that each text was evaluated by one of three groups with each 17-19 participants. As compensation, 3.2 Analyses participants had the chance to win one of fifteen 10 € Amazon vouchers. The inter-rater reliability We chose to conduct our regression analyses with a Bayesian approach, which has impor- 3 cf. http://irtel.uni-mannheim.de/pxlab/ tant advantages over the traditional frequentist demos/index_SAM.html accessed Feb. 2019 null hypothesis significance testing. Within the Figure 1: Convolutional Neural Network Model from Deriu et al. (2017) Bayesian approach, the interpretation of data is 4 Results not affected by sampling intention. In contrast to the frequentist approach, the Bayesian approach The correlation between the sentiment value - permits assessment of the relative credibility of calculated as the difference between negative parameter values given the data and the statistical and positive sentiment - and the lexical value model (Kruschke, 2010). The statistical analyses of valence was r = .50 (95% Credible Interval - were conducted with R version 3.3.4 (R Core CrI = [.28; .72]). Both of them had a moderate Team, 2017) and the R package brms version 2.4.0 positive correlation with the subjective valence (Bürkner, 2018), which is a package for Bayesian ratings of the texts (r = .51 (95% CrI = [.28; .72] generalized multivariate non-linear multilevel for sentiment; r = .62 (95% CrI = [.42; .82] for models. To allow comparisons with other studies lexical valence). There was a weak correlation that correlated lexical or sentimental analysis between the Flesch readability score and the other with subjective ratings of texts, we calculated three variables (sentiment: r = −.24 (95% CrI = correlations of the standardized values averaged [-.48; .01]; lexical valence: r = −.12 95% CrI = over the 63 texts as beta values with brms. For the [-.37; .12]; subjective valence: r = −.24 (95% CrI multilevel models predicting subjective valence, = [-.50; .01]). the raw data of all 55 raters were included in the A visual inspection of the MCMC chains and the model with rater as a level 2 predictor. The re- R-hat diagnostic with all R-hat values < 1.02 sulting sample included 1143 observations, i.e. 63 revealed good convergence for all estimated pa- texts with an average of 18 raters. The predictors rameters of all calculated models. (sentiment, lexical valence, and Flesch Index) were The restricted model (Model 0) including the averaged for each of the 63 texts. The subjective intercepts and the level 2 variable only had a valence ratings - an ordinal scaled variable with LOOic of 4969. Model 1 predicting subjective values ranging from 1 to 9 - were modelled with valence by sentiment had an effect of β = 4.60 a cumulative distribution. The Bayesian Credible (95% CrI = [2.62; 6.56]). The LOOic was 4717. Interval, meaning the range a certain value lies Model 2 predicting subjective valence by lexical within with a probability of 95% (thus not to be valence had an effect of β = 5.37 (95% CrI = confused with the frequentist Confidence Interval!) [3.62; 7.01]) with a LOOic of 4578. To decide, is reported for all results. Since this is the first which model is to prefer, we relied on the credible study in this context applying Bayesian analysis, intervals of the LOOic. The credible intervals of no informative priors were available. We thus the LOOic of Model 1 and 2 did not overlap with decided to use brms’ default priors. The Leave- the LOOic of the restricted model 0 (see Table One-Out Cross-Validation information criteria 2). That lead to our conclusion that both models (LOOic) was used to compare the different models. predicted subjective valence and that the first The LOOic is a method ”for estimating pointwise hypothesis could be confirmed. out-of-sample prediction accuracy from a fitted Bayesian model using the log-likelihood evaluated at the posterior simulations of the parameter Model 3 predicting subjective valence of texts values” (p. 1413; (Vehtari et al., 2017)). by sentiment and lexical valence (BAWL-R) is pre- sented in Table 3. The design formula for model 3 was formulated as follows: Model LOOic se CrI 5% CrI 95% model 2, we consider the difference of their LOOic M0 4969 17 4935 5002 big enough to conclude that model 3, i.e. the M1 4717 36 4646 4787 prediction of subjective valence is better when M2 4578 39 4502 4654 sentiment and lexical analysis are combined than M3 4515 42 4433 4597 either one of them on their one. This confirmed M 3+ 4494 42 4412 4577 the second hypothesis. The level 2 predictor (text raters) in model 3 had a small but negligible effect Table 2: LOO information criteria with standard on subjective valence. error and credible intervals (CrI) The integration of readability (Flesch-Index) did not improve the model. The credible intervals were mostly overlapping and the information gain was negligible with a small effect of readability on subjective valence (β = −0.04; 95% CrI = [-0.09; Ri ∼ Ordered(p) [likelihood] -0.02]; LOOic = 4494). The LOOic, standard logit(pk ) = αk − φi [cumulative link errors, and credible intervals of the five models are listed in Table 2. and linear model] In Figure 2, the prediction of subjective valence φi = βBAW L BAW Li + βSent Senti [linear model] by sentiment and lexical valence given the data αk ∼ Normal(0, 10) [common prior and the statistical model considering the effects of for each intercept] both predictors (model 3) is visualized. The figure shows the slope (blue line) with its 95% gray βBAW L ∼ Normal(0, 10) [βBAW L prior] shadowed credible interval. Both predictors have βSent ∼ Normal(0, 10) [βSent prior] a tight credible interval that does not include 0, indicating clear positive effects for both predictors. Ri is the ordered distribution (i.e. a cate- gorical distribution that takes the vector p = 5 Discussion {p1 , p2 , p3 , p4 , p5 , p6 , p7 , p8 } of probabilities of each subjective valence rating value below the maxi- The aim of this study was to compare different mum category of 9). αk is the unique intercept techniques to capture emotions in 63 short texts. of each possible outcome value k, φi is the linear Our investigations focused on the question whether model that is subtracted from each intercept, the prediction of readers’ subjectively appraised βBAW L and βSent are the slopes of the BAWL-R emotions towards texts improves when word list- (lexical analysis) and sentiment values respectively based lexical analysis and sentiment analysis are and BAW Li and Senti are the values of both both considered. The results confirmed our hy- predictor variables on row i. potheses that lexical and sentiment analyses both predict subjective valence independently or in com- bination (hypothesis 1). The strongest effect re- Parameter R̂ n eff β CrI 5% 95% sulted when both approaches were combined (hy- braterIntercept 1.01 1242 0.11 0.01 0.28 pothesis 2). That confirms Dhaoui et al. (2017) bsentiment 1.00 4000 2.07 1.58 2.59 and corresponds to findings of Aman (2007) and bvalence 1.00 4000 3.42 2.96 3.89 Strapparava and Mihalcea (2010) that combina- tions of algorithms result in better predictions. Table 3: Results of Bayesian linear regression anal- Other studies predicting subjective valence rat- ysis ings of texts with the same word list (BAWL-R) found correlations of r = .58 (Hsu et al. (2015); for short passages of the Harry Potter novels) and The subjective valence was predicted by sen- r = .65 ((Ullrich et al., 2017); for poems of En- timent with β = 2.07 (95% CrI = [1.58; 2.59]), zensberger), which are in the same range as our and by lexical valence with β = 3.42 (95% CrI results (r = .62). There are studies for instance = [2.96; 3.89]). The LOO information criteria of Settanni and Marengo (2015) using other word list model 3 (LOOic = 4515) was smaller than that of (e.g. LIWC) with lower correlations between nega- either of the other models. The credible interval tive emotions expressed on Facebook posts and cor- of model 1, but not of model 2 does not overlap responding subjective negative emotions (r = .22, with the credible interval of the combined model for younger people r = .40). Similar is found for 3. We conclude that sentiment and lexical analysis the sentiment analysis, where our results (r = .50) predict subjective valence better than sentiment correspond to results published in the literature. analysis alone. Even if the credible interval of Correlations of different algorithms with subjec- model 3 overlaps with the credible interval of tive valence ratings were reported for instance by Figure 2: Prediction of subjective valence by sentiment and lexical valence Strapparava and Mihalcea (2010) in detecting sen- texts has an impact on the emotions when reading timent in headlines. The algorithm with the best (e.g. Yin et al., 2014; Ben-David et al., 2016). To predictive power was the CLaC system that ”re- take this into account we investigated an additional lies on a knowledge-based domain-independent un- model 3+. One way to determine the text diffi- supervised approach to headline valence detection culty is with the Flesh-Index (Flesch, 1948; Am- and scoring. The system uses three main kinds stad, 1978). When this predictor was taken into of knowledge: a list of sentiment-bearing words, a account in the model, only a very small effect could list of valence shifters and a set of rules that de- be found. Therefore, we decided not to pursue this fine the scope and the result of the combination of additional variant any further. One reason for the sentiment-bearing words and valence shifters” (p. small effect of the Flesh-Index might be that when 28). This algorithm found a correlation of r = .48 selecting the texts at the beginning of the study we for valence. The correlations with the other four made sure that none of the texts used had extreme algorithms were all below r = .40. In comparison Flesch values in order to avoid biases of the mea- to other studies, the sentiment analysis used in this surement results due to comprehension problems. study revealed a rather high correlation. A more An explanation for the weaker performance of the recent study (Preoţiuc-Pietro et al., 2016) found sentiment analysis compared to the lexical rating a higher correlation of r = .65 between sentiment may be that the 63 analysed texts were part of an analysis and subjective valence with a bag-of-words international database with emotional short texts linear regression model. (Kaakinen et al., in preparation). In this context, When both measurement techniques were com- the emotional content of the entire text (not on bined in a model, the effect of lexical valence pre- the word or sentence level) was assessed by student dicting subjective valence was stronger than the raters. This differs from the method of sentiment effect of the sentiment analysis. The β = 2.07 in analysis, which was applied on each short story model 3 for sentiment means that an increase of but was trained on Twitter messages, i.e. sentence 1 SD in the sentiment values corresponds to an level (Cieliebak et al., 2017). As in other studies, increase of 2.07 SDs in the predicted subjective va- the lexical analysis is based on the average valence lence. Likewise, the β = 3.42 for lexical valence of words that previously were evaluated by stu- means that an increase of 1 SD in the lexical va- dents (Vo et al., 2009). We assigned each word lence values corresponds to an increase of 3.42 SDs of the short texts, which was also included in the in the predicted subjective valence. This stronger Berlin Affective Word List, its valence value and effect of lexical analysis was also visible in the cor- averaged these values getting a mean value for each relations of each variables with subjective valence. short text. We assume that due to different aspects Both predictors correlate with each other (r = .50), of valence measured in the procedures mentioned, and therefore share a good part of their variance. the combination model and therefore the combina- This explains that overlap between the credible in- tion of the different aspects of valence measured tervals of model 3 (sentiment and lexical analysis as achieved the best prediction values. However, in predictors) and model 2 (lexical analysis as predic- order to actually confirm this assumption we need tors). Nevertheless, we considered the information to further investigate whether the correlations be- gain of model 3 over model 2 to be large enough tween the three measurements and the fit of the and therefore favour model 3. different models remain at a lower taxonomic level, It is known from literature that the difficulty of i.e. at the sentence respectively word level instead of the text level, and observe the predictive power Acknowledgments accordingly. Another question is whether the com- A big ’Thank you’ to Yvonne Kammerer of the bination of measurement techniques developed and Leibnitz-Institut für Wissensmedien in Tübingen. validated in a context other than short stories, such She organized the collecting of the text from Ger- as the sentiment analysis using tweets, is appropri- many, and the rating of the 63 texts that were col- ate, or whether it is better to use other techniques lected as part of a project in the COST Action E- developed in the same context. We are under the READ. We also thank Mark Cieliebak and Jan De- impression, that the sentiment analysis applied in riu for providing the sentiment prediction system this study did a pretty good job compared to other and the helpful discussions, and Stéphanie McGar- procedures. rity for proofreading and useful suggestions. There are some aspects to our approach that we did not account for in this study that may be worth exploring in future studies. One such aspect is the References perceived difficulty of the rating task since the sub- jective ratings may be biased if the task is thought Saima Aman. 2007. Recognizing emotions in text. to be either particularly easy or particularly diffi- Ph.D. thesis, University of Ottawa (Canada). cult. This concerns both the text ratings as well as the ratings that resulted in the two analysis ap- T Amstad. 1978. Wie verständlich sind unsere zeitungen?[how understandable are our newspa- proaches that rely on subjective expert ratings at pers?]. Unpublished doctoral dissertation, Uni- their very core. Another aspect worthy of inspec- versity of Zürich, Switzerland . tion is the discrepancy between human ratings and both analysis approaches since they do not neces- Giulio Angiani, Laura Ferrari, Tomaso Fontanini, sarily align at all times. Exploring under which cir- Paolo Fornacciari, Eleonora Iotti, Federico cumstances they diverge may prove to be a promis- Magliani, and Stefano Manicardi. 2016. A com- ing venture. parison between preprocessing techniques for sentiment analysis in twitter. In KDWeb. 6 Conclusions A Aryani, M Kraxenberger, S Ullrich, AM Jacobs, The results indicate that lexical and sentiment and M Conrad. 2015. Measuring the basic af- fective tone in poetry using phonological iconic- analyses predict subjective appraisal of emotions ity and subsyllabic salience. Psychol. Aesthet. triggered by short texts. The two methods are not Creat. Arts . redundant. It is therefore worthwhile analyzing the emotional potential of texts applying both R Harald Baayen, Richard Piepenbrock, and measurement procedures. A next step is to repeat H Van Rijn. 1993. The celex lexical database these analyses on sentence and on word level to (cd-rom). linguistic data consortium. Philadel- check whether we get an even stronger predictive phia, PA: University of Pennsylvania . power. We also need to examine the integration Lisa Feldman Barrett and James A Russell. 1999. of other text properties, including other semantic The structure of current affect: Controversies parameters, into our analysis, as done by Jacobs and emerging consensus. Current directions in and Kinder (2017). The small effect gain of the psychological science 8(1):10–14. Flesch-Index can be interpreted as an indication that non-emotional text properties could play a Boaz M Ben-David, Maroof I Moral, Aravind K role in the perception of emotions in a text. Namasivayam, Hadas Erel, and Pascal HHM The results found, for example, can contribute to van Lieshout. 2016. Linguistic and emotional- the development of new instructional designs that valence characteristics of reading passages for measure emotional appraisals of students engaged clinical use and research. Journal of fluency dis- orders 49:1–12. in digital learning tasks. Positive or negative emotions in the texts or online-contributions of Margaret M Bradley and Peter J Lang. 1994. Mea- students can be assessed in automated form and suring emotion: The Self-Assessment Manikin transferred into instructional measures, and thus and the semantic differential. Journal of Be- help to integrate automated learning support into havioral Therapy and Experimental Psychiatry feedback, hints or adaptive instructional design. 25(1):49–59. We need even more predictive power for useful integration of such sensors, i.e. measurements Benny B Briesemeister, Lars Kuchinke, and Arthur M Jacobs. 2012. Emotional valence: A of emotional or affective properties of texts in bipolar continuum or two independent dimen- digital learning, in educational practice. From our sions? SAGE Open 2(4):2158244012466558. point of view, this can be achieved by combining different measurement methods Karl Bühler. 1934. Sprachtheorie (language the- ory). Stuttgart: G. Fischer . Paul-Christian Bürkner. 2018. Advanced Bayesian Arthur M Jacobs and Annette Kinder. 2017. “the multilevel modeling with the R package brms. brain is the prisoner of thought”: A machine- The R Journal 10(1):395–411. learning assisted quantitative narrative analysis of literary metaphors for use in neurocognitive Mark Cieliebak, Jan Milan Deriu, Dominic Egger, poetics. Metaphor and Symbol 32(3):139–160. and Fatih Uzdilli. 2017. A twitter corpus and benchmark resources for german sentiment anal- Arthur M Jacobs, Melissa L-H Võ, Benny B Briese- ysis. In 5th International Workshop on Natural meister, Markus Conrad, Markus J Hofmann, Language Processing for Social Media, Boston, Lars Kuchinke, Jana Lüdtke, and Mario Braun. MA, USA, December 11, 2017 . Association for 2015. 10 years of bawling into affective and aes- Computational Linguistics, pages 45–51. thetic processes in reading: what are the echoes? Frontiers in psychology 6:714. Sanjiv Das and M Chan. 2001. Extracting mar- ket sentiment from stock message boards. Asia Johanna K Kaakinen, Egon Werlen, Yvonne Kam- Pacific Finance Association 2001. merer, S Ruiz-Fernandez, Cengiz Acartürk, Xavier Aparicio, Thierry Baccino, Ugo Bal- Jan Deriu, Aurelien Lucchi, Valeria De Luca, Ali- lenghein, Per Bergamin, Nuria Castells Gomez, aksei Severyn, Simon Müller, Mark Cieliebak, Armanda Costa, Isabel Falé, Olga Megalakaki, Thomas Hofmann, and Martin Jaggi. 2017. and M Minguela. in preparation. Emotional Leveraging Large Amounts of Weakly Super- text database [working title]. Behavior research vised Data for Multi-Language Sentiment Clas- methods . sification. In WWW 2017 - International World Wide Web Conference. Perth, Australia. Assaf Kron, Maryna Pilkiw, Jasmin Banaei, Ariel Goldstein, and Adam Keith Anderson. 2015. Chedia Dhaoui, Cynthia M Webster, and Lay Peng Are valence and arousal separable in emotional Tan. 2017. Social media sentiment analysis: lex- experience? Emotion 15(1):35. icon versus machine learning. Journal of Con- sumer Marketing 34(6):480–488. John K Kruschke. 2010. What to believe: Bayesian methods for data analysis. Trends in cognitive Rudolph Flesch. 1948. A new readability yardstick. sciences 14(7):293–300. Journal of applied psychology 32(3):221. PJ Lang. 1980. Self-assessment manikin. Sigmund Freud. 1891. Zur auffassung der Gainesville, FL: The Center for Research in aphasien: eine kritische studie. F. Deuticke. Psychophysiology, University of Florida . Enrique Fueyo. 2018. Understanding Moritz Lehne, Philipp Engel, Martin Rohrmeier, what is behind sentiment analysis Winfried Menninghaus, Arthur M Jacobs, and (part i). Last accessed 6 March 2019. Stefan Koelsch. 2015. Reading a suspenseful lit- https://building.lang.ai/understanding-what-is- erary text activates brain areas related to social behind-sentiment-analysis-part-i-eaf1bcb43d2d. cognition and predictive inference. PLoS One Matthias Gamer, Jim Lemon, Maintainer Matthias 10(5):e0124550. Gamer, A Robinson, and W Kendall’s. 2012. Ana Carolina ES Lima, Leandro Nunes de Castro, Package ‘irr’. Various coefficients of interrater and Juan M Corchado. 2015. A polarity analysis reliability and agreement . framework for twitter messages. Applied Mathe- Chun-Ting Hsu, Arthur M Jacobs, Francesca MM matics and Computation 270:756–767. Citron, and Markus Conrad. 2015. The emo- Bing Liu. 2012. Sentiment analysis and opinion tion potential of words and passages in reading mining. Synthesis lectures on human language harry potter–an fmri study. Brain and language technologies 5(1):1–167. 142:96–114. Tim Loughran and Bill McDonald. 2015. The use Ashlee Humphreys and Rebecca Jen-Hui Wang. of word lists in textual analysis. Journal of Be- 2017. Automated text analysis for con- havioral Finance 16(1):1–11. sumer research. Journal of Consumer Research 44(6):1274–1306. Iris B Mauss and Michael D Robinson. 2009. Mea- sures of emotion: A review. Cognition and emo- Clayton J Hutto and Eric Gilbert. 2014. Vader: tion 23(2):209–237. A parsimonious rule-based model for sentiment analysis of social media text. In Eighth inter- Meik Michalke. 2018. sylly: Hyphenation and Syl- national AAAI conference on weblogs and social lable Counting for Text Analysis. (Version 0.1- media. 5). https://reaktanz.de/?c=hacking&s=sylly. Arthur M Jacobs. 2015. Neurocognitive poet- Kevin W Mossholder, Randall P Settoon, Stan- ics: methods and models for investigating the ley G Harris, and Achilles A Armenakis. 1995. neuronal and cognitive-affective bases of litera- Measuring emotion in open-ended survey re- ture reception. Frontiers in human neuroscience sponses: An application of textual data analysis. 9:186. Journal of management 21(2):335–355. Tetsuya Nasukawa and Jeonghee Yi. 2003. Senti- Melissa LH Vo, Markus Conrad, Lars Kuchinke, ment analysis: Capturing favorability using nat- Karolina Urton, Markus J Hofmann, and ural language processing. In Proceedings of the Arthur M Jacobs. 2009. The berlin affective 2nd international conference on Knowledge cap- word list reloaded (bawl-r). Behavior research ture. ACM, pages 70–77. methods 41(2):534–538. Finn Årup Nielsen. 2011. A new anew: Evalua- Cynthia Whissell. 1996. Traditional and emotional tion of a word list for sentiment analysis in mi- stylometric analysis of the songs of beatles paul croblogs. arXiv preprint arXiv:1103.2903 . mccartney and john lennon. Computers and the Humanities 30(3):257–265. Catherine J Norris, Jackie Gollan, Gary G Berntson, and John T Cacioppo. 2010. The cur- Cynthia Whissell. 2011. To those who feel rather rent status of research on the structure of evalua- than to those who think: Sound and emotion tive space. Biological psychology 84(3):422–436. in poes poetry. International Journal of English and Literature 2(6):149–156. Daniel Preoţiuc-Pietro, H Andrew Schwartz, Gre- gory Park, Johannes Eichstaedt, Margaret Kern, Hadley Wickham. 2017. The tidyverse. R package Lyle Ungar, and Elisabeth Shulman. 2016. Mod- ver. 1.1. 1 . elling valence and arousal in facebook posts. In Wilhelm Wundt. 1896. Grundriss der Psychologie. Proceedings of the 7th Workshop on Computa- Engelmann. tional Approaches to Subjectivity, Sentiment and Social Media Analysis. pages 9–15. Guopeng Yin, Qingyuan Zhang, and Yimeng Li. 2014. Effects of emotional valence and arousal R Core Team. 2017. A language and environment on consumer perceptions of online review help- for statistical computing http://www. r-project. fulness. In Twentieth Americas Conference on org. Information Systems. Savannah, USA. Michele Settanni and Davide Marengo. 2015. Shar- ing feelings online: studying emotional well- being via automated text analysis of facebook posts. Frontiers in psychology 6:1045. Antonette Shibani. 2017. Combining automated and peer feedback for effective learning design in writing practices. In ICCE 2017-25th In- ternational Conference on Computers in Edu- cation: Technology and Innovation: Computer- Based Educational Systems for the 21st Century, Doctoral Student Consortia Proceedings. Vera Shuman, David Sander, and Klaus R Scherer. 2013. Levels of valence. Frontiers in Psychology 4:261. Carlo Strapparava and Rada Mihalcea. 2010. An- notating and identifying emotions in text. In Intelligent Information Access, Springer, pages 21–38. Hyeon-Jeong Suk. 2006. Color and Emotion-a study on the affective judgment across media and in relation to visual stimuli. Ph.D. thesis, Uni- versität Mannheim. Susann Ullrich, Arash Aryani, Maria Kraxen- berger, Arthur M Jacobs, and Markus Conrad. 2017. On the relation between the general af- fective meaning and the basic sublexical, lexical, and inter-lexical features of poetic texts—a case study using 57 poems of hm enzensberger. Fron- tiers in psychology 7:2073. Aki Vehtari, Andrew Gelman, and Jonah Gabry. 2017. Practical bayesian model evaluation using leave-one-out cross-validation and waic. Statis- tics and Computing 27(5):1413–1432.