Sentiment Below the Surface: Omissive and
                                Evocative Strategies in Literature and Beyond
                                Pascale Feldkamp1 , Ea Lindhardt Overgaard2 , Kristoffer Nielbo1 and Yuri Bizzoni1
                                1
                                    Center for Humanities Computing Aarhus, Jens Chr. Skous Vej 4, Building 1483, 8000 Aarhus C, Denmark
                                2
                                    School of Communication and Culture, Jens Chr. Skous Vej 2, Building 1485, 8000 Aarhus C, Denmark


                                              Abstract
                                              As they represent one of the most complex forms of expression, literary texts continue to challenge
                                              Sentiment Analysis (SA) tools, often developed for other domains. At the same time, SA is becoming
                                              an increasingly central method in literary analysis itself, which raises the question of what are the chal-
                                              lenges inherent to literary SA. We address this question by probing units from a variety of literary fiction
                                              texts where humans and systems diverge in their valence scoring, seeking to relate such disagreements
                                              to semantic traits central to implicit sentiment evocation in literary theory. The contribution of this
                                              study is twofold. First, we present a corpus of valence-annotated fiction – English and Danish language
                                              literary texts from the 19th and 20th centuries – representing different genres. We then test whether
                                              sentences where humans and models disagree in sentiment annotation are characterized by specific
                                              semantic traits by looking at their distribution and correlation across four different corpora. We find
                                              that items where humans detected significant sentiment, but where models did not, consistently employ
                                              lower levels of arousal, dominance and interoception, and higher levels of concreteness. Furthermore, we
                                              find that the amount of human-model disagreement correlated with semantic aspects is linked to the
                                              interiority-exteriority continuum more than with direct sensory information. Finally, we show that
                                              this interaction of features linked to implicit sentiment varies across textual domains. Our findings con-
                                              firm that sentiment evocation exploits a more diverse and subtle set of semantic channels than those
                                              observed through simple sentiment analysis.

                                              Keywords
                                              sentiment expression, literary language, implicitness, objective correlative, sentiment analysis


                                1. Introduction
                                Sentiment Analysis (SA) is an increasingly central method for computational literary research
                                [55], an especially popular application being that of gauging the ‘sentiment arcs’ of novels, i.e.,
                                the ‘shapes of stories’ [38, 54, 35]. Still, the relation between valence extracted with SA tools
                                and the human perception of literary texts at a granular level remains an open question, as
                                tools applied to the literary domain are primarily geared towards processing nonliterary texts.
                                    While some recent work has examined the adequacy of available SA tools for literary analysis
                                [27, 11, 56], the question of how to validate them – against whose judgements, and at what level
                                (i.e., the story-level vs. sentence-level, etc.) – is a persistent concern, also due to the relative

                                CHR 2024: Computational Humanities Research Conference, December 4-6, Aarhus, Denmark.
                                £ pascale.moreira@cc.au.dk (P. Feldkamp); @cc.au.dk (E. L. Overgaard); kln@cas.au.dk (K. Nielbo);
                                yuri.bizzoni@cc.au.dk (Y. Bizzoni)
                                ȉ 0000-0002-2434-4268 (P. Feldkamp); 0000-0000-0000-0000 (E. L. Overgaard); 0000-0002-5116-5070 (K. Nielbo);
                                0000-0002-6981-7903 (Y. Bizzoni)
                                            © 2021 Copyright for this paper by its authors.
                                            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                            681
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
scarcity of annotated resources in the literary domain. Moreover, the observed inadequacy of
SA tools for literature has raised the question of what the difference of literary texts might be
in comparison to the nonliterary [10], where tools seem to perform comparably better [63].
How does the “literary” differ in its way of communicating sentiments to readers?
   In fact, due to their textual complexity, literary texts are often said to differ from more com-
municative texts [34]: they are effective at multiple narrative levels [60, 12]; creatively diver-
gent from standard language use [47, 5]; reliant on poetic devices [16]; and ambiguity, effecting
contesting interpretations [58]. Moreover, literary language may exhibit various strategies for
conveying emotion beyond simply using words directly associated with emotional states (e.g.,
“sad”). While language on, e.g., social media, may also rely on omission and subtlety, literary
theorists have frequently claimed that literariness or the poetic function of language excels in
this regard and is distinct from its more directly communicative function.1 Regarding senti-
ment expression, it has recently been suggested that literary prose relies on specific semantic
traits connected to affective understatement and concreteness to evoke – rather than “commu-
nicate” – sentiment in readers [10].2 These features are especially interesting since they are
related to the seminal concept of the “objective correlative” in literary theory, which suggests
that literary texts effectively convey emotions by grounding them in – concrete and objective,
rather than emotional and subjective – entities and situations [26], while avoiding abstract and
emotional language.3 In this study, we further pursue the hypothesis that literary texts rely on
omission and materiality, i.e., features connected to the concept of the objective correlative, to
evoke sentiments, rather than simply mention them.
   While divergences between human and model judgments in sentiment analysis are gener-
ally taken to indicate shortcomings in SA tools, they present an interesting case for testing the
difference of literary language in expressing sentiment, assuming that SA tools are generally
tuned toward more explicit forms of communication due to their development on e.g. social
media – also suggested by more general studies of implicitness [75, 40]. Using a single novel as
their data, Bizzoni and Feldkamp [10] explored this discrepancy, finding that certain semantic
elements – levels of arousal, dominance and concreteness – are indicative of human/model dis-
agreement. To gain further insight into sentiment expression in literary texts, we go one step
further: we test the distinguishing power of these features in a much larger corpus of annotated
literary texts. We also extend the list of features by an additional four: imageability, visuality,
hapticity, and interoception, in order to further examine the connection of human/model dis-
agreement with features that relate to the concept of the objective correlative and of omissive
strategies in literary sentiment expression.4
   We conduct two experiments. First, we test whether sentences where models and humans
agree can be distinguished from sentences where they disagree based on these specific seman-
tic traits. Secondly, we further explore the robustness of the relation between features and
sentiment evocation by testing the correlation of these features to the absolute difference in

1
  Jakobson distinguished the “poetic function” of language from its “emotive or expressive function”, which “aims a
  direct expression of the speaker’s attitude toward what he is speaking about” [34, p. 66]
2
  Again, these phenomena naturally extend outside the literary domain [57]: tweets using irony or figurative lan-
  guage, e.g., likely effect diverging reader interpretations [61, 67].
3
  Besides T.S. Eliot, who coined the term, other famous proponents of this view are the Imagists [52].
4
  All data and code for the present study is available here.


                                                       682
sentiment scoring between humans and models.


2. Related works
2.1. Literary sentiment analysis
While SA tools perform increasingly well in some domains [2, 63], studies have pointed out the
cross-domain drop in performance, as well as the lagging behind of tools for under-resourced
languages [28, 49, 14]. Still, some studies have suggested that Transformer-based models might
be able to bridge the gap and perform better on literary or poetic material [65]. Assessing
the performance of models on historical Danish and Norwegian literary texts, Allaith, Degn,
Conroy, Pedersen, Bjerring-Hansen, and Hershcovich [3] found that multilingual Transformers
outperformed both fine-tuned models and classifiers based on lexical resources in the target
language, which aligns with the findings of Schmidt, Dennerlein, and Wolff [65] and Schmidt
and Burghardt [64] for historical German drama. Still, the disparity between human and model
sentiment judgements in literary texts continues to be observed [11, 56]. The disparity is often
related to the effects of narrative, annotators generally having access to the narrative context
of a sentence, rather than to differing strategies in sentiment expression across domains.

2.2. Literary sentiment expression
The concept of “implicit” expression is particularly relevant, and complex, in literary writing.
Several theories of writing point to the importance of avoiding concepts or ideas (however this
is intended) in a too explicit way. The widely known precept of “Show Don’t Tell” points at
least partly in this direction [12]. Moreover, critics continually rely on terms like emotional
“evocativeness” and “understatement” to describe writing styles [69, 24]. Literary and affect
theory has also more recently foregrounded the use of materiality and sensuousness in litera-
ture – including poetry – to evoke affective reactions in readers, emphasizing the way objects
and things are culturally invested with meaning and affect [1, 17] and thus utilized by authors
to evoke embodied, a ective experiences [51].5 Despite the significance of implicit, evocative,
and expressive strategies, there is little consensus on how to reliably track these in literature
and whether such types of expression have recognizable linguistic markers.
   The association of materiality to literary evocation is not new, and closely relates to the Mod-
ernists’ and New Critics’ valuation of concreteness over abstraction [69], as well as to the idea
of the ‘objective correlative’ which T.S. Eliot [26] proposed in 1948. Eliot suggested that the ef-
fective way of expressing emotion in literature is ‘by proxy’, through an external and objective
– in the sense of intersubjectively recognizeable – “set of objects, a situation, a chain of events
which shall be the formula of [a] particular emotion” [26]. The concept of the objective cor-
relative suggests that literary language effectively evokes sentiments in readers by being both


5
    In narratology, this is close to what Fludernik has termed narrative “experentiality” [29]. Burroway [20] explicitly
    notes that when using of nouns that evoke sense images and verbs that represent visualizeable actions “the writing
    comes alive”.


                                                           683
more omissive (relying less on directly emotion-associated words)6 and more concrete (relying
on objects and situations).
   This hypothesis has been supported by computational literary studies. Auracher and Bosch
[6] found that the concreteness of literary language impacts the emotional engagement of read-
ers and their experience of suspense, and Bizzoni and Feldkamp [10] tracked the “omissive”
writing of Ernest Hemingway by looking at the amount and intensity of sentiment expressions
detectable in sentences, compared to how “expressive” (in terms of sentiment) readers perceive
these sentences to be. Comparing sentences where humans and models agree vs. those where
they disagree, they found that arousal and dominance levels (of the NRC-VAD lexicon [45])
were indicative of omissive strategies of evocation. Moreover, the level of concreteness of the
language used appeared higher in sentences with higher disagreement between humans’ and
models’ valence attribution.

2.3. Semantic traits of literariness
In computational linguistics, the use of semantic traits – often derived from psycholinguistic
norms – for the study of narrative is relatively frequent [70, 36, 43]. Yet, their use to model
poetic literary strategies of sentiment evocation is still relatively pioneering. Kao and Jurafsky
[36] applied semantic measures to poetry: imageability to gauge “imagery”and concreteness
to gauge “concrete imagery”, using the concreteness ratings of Brysbaert, Warriner, and Ku-
perman [18], along with objective/abstract word categories [68], as well as psycholinguistic
norms [50] to model “emotional language”. They show that these features differ between “am-
ateur” and appraised poets, where appraised poets use more concrete and imageable language
and less emotional words. Conversely, Maslej, Mar, and Kuperman [43] find that abstraction
and arousal correlate positively with the perceived interest of readers in fictional characters,
which is perhaps related to the general tendency of abstract concepts to be more emotionally va-
lenced than concrete ones [39]. Ullrich, Aryani, Kraxenberger, Jacobs, and Conrad [70] observe
how the differences between perceived affect in poem annotations can be largely explained by
lexical psycholinguistic norms, with sentiment dimensions like arousal being one of the best
predictors of the perceived affective meaning of the poems. The studies show above all that
imageability, concreteness and arousal strongly relate to how readers feel about literary texts.


6
    The advice against “sentimentalism” and abstraction in literary language is present in Eliot, though more promi-
    nent in e.g., Ezra Pounds’ literary criticism [52].


                                                         684
Table 1
Used datasets with valence annotation. For all but the FB dataset, valence was annotated on a sentence
basis, so the number of annotations generally indicates a number of sentences. The Spearman correla-
tion (𝜌) between the human mean and RoBERTa scores (H/R) is provided (for all, 𝑝 < .01). Summing
up, the total number of annotated lines considered in this study is 𝑛 = 19, 327. The Annotator/line
indicates the number of annotators of valence reported per line in the corpus.
                      N. annotations      N. words     𝑥 ̄ words/line   Annotators/line      H/R correlation
    FB                            2,895      57,436              19.8                    2                0.78
    EmoBank                       8,735     173,958              19.9                  10                 0.65
      Letters                     1,344      25,550              19.0                  10                 0.69
      Blog                        1,323      25,691              19.4                  10                 0.68
      Newspaper                   1,308      31,647              24.2                  10                 0.62
      Essays                      1,131      30,958              27.4                  10                 0.62
      Fiction                     2,711      39,393              14.5                  10                 0.58
      Travel-guides                 918      20,719              22.6                  10                 0.48
    Fiction4                      6,300      73,250              11.6                  >2                 0.64
        Hymns                     2,026      12,798               6.3                   2                 0.67
        Fairy tales                 772      18,597              24.1                   3                 0.57
        Prose                     1,923      30,279              15.7                   2                 0.59
        Poetry                    1,579      11,576               7.3                   3                 0.56


3. Data
3.1. Datasets
To probe textual features of implicit sentiment, we created a diverse corpus of fiction spanning
four genres, manually annotated for valence (Fiction4). For the second experiment of this pa-
per – comparing more or less literary genres – we also selected other datasets to represent a
diversity of genres in the literary and non-literary domains which had all been annotated for
valence on a continuous scale.7 As such, we include social media, journalism, and genres with
various degrees of literariness (essays, letters, travel-writing, fairytales), which may also be
thought to represent degrees of colloquialism (from blogs to journalism).8

Fiction4: The new dataset presented in this study, with human annotations for valence (𝑛 =
6, 300).9 It includes four different genres – fairy tales, hymns, prose and poetry – over the 19th
and 20th century, in one high- and one low-resource language (English and Danish). The corpus
was compiled with an aim toward diversity (in genre, time and language) while still aiming to
include well-known and culturally significant works by both male and female authors.10 We
also took into account both i) the texts’ cultural significance and ii) their narrative and poetic
complexity, which may represent a particular challenge to SA tools. The authors selected were
7
  We have adopted a very broad understanding of genre for this paper, encompassing more or less literary genres.
8
  We standardized the varying valence scales of chosen corpora to a scale from -1 to 1 (negative to positive).
9
  For an overview table of this corpus, see Appendix.
10
   Note, however, that the corpus is highly skewed towards male writers, not least because of the time-period and
   genres covered (i.e., hymns). For a detailed overview of the corpus, see Table 7 in Appendix.


                                                      685
Ernest Hemingway, Sylvia Plath, H.C. Andersen and the various authors of ofÏcial hymn-books
(for details, see Appendix, Table 7).
   i) Regarding their cultural significance, for the English texts, Hemingway is perhaps one of
the most famous 20th century authors for his prose, and his texts are read in education11 and
among the public.12 Plath, similarly, is a widely read and acclaimed poet, perhaps the best
known female American poet of the 20th century.13
   For the Danish texts, Andersen’s production is arguably the most central in Danish literary
heritage [59]. While being less known internationally, the ofÏcial hymnal book is the most
widely distributed “poetry book” in Denmark [62],14 used in the Danish education system at
all levels, and shapes national cultural identity [9].
   ii) Regarding the complexity of the corpus for the sentiment annotation task, we ensured
variance across genre, place and time, but also emphasized, from a literary perspective, the
level of literary complexity, including texts that could be considered either particularly chal-
lenging or particularly simple. We consider Hemingway’s The Old Man and the Sea and Plath’s
Ariel as two ideally difÏcult cases for testing SA tools. Hemingway is known for an especially
“omissive” writing style, direct and limited in its use of figurative language [31], while rely-
ing on implication rather than “overt emotional display”, leaving much inference up to the
reader [69]. Hemingway’s The Old Man and the Sea (1952) has been considered emblematic for
this minimalist style, which may omit characteristics that models rely on in sentiment scoring.
Plath’s poetry collection Ariel (1965) is complex in a slightly different way.15 The so-called
confessional poetry genre, of which Plath is considered emblematic, foregrounds idiosyncratic
personal psychology and experiences against the “emotional vacuity of public language” and
universal symbols [46]. In Ariel Plath writes on complex and political themes in idiosyncratic
style consisting of “hallucinatory images” and novel metaphors [15], and the work has been
used as a case of literature posing particular difÏculty to most readers [25].
   Conversely, we consider Andersen and religious hymns two ideally simple cases for testing
SA tools. Andersen’s fairy tales16 are characterized by an essential simplicity, both stylistically
an in their narrative progression [41, 4] and their ability to engage both children and adult read-
ers [41]. Religious hymns17 are characterized by their limited number of themes (e.g. worship,
thanksgiving, etc.), which are are expressed through well-known and formalized (as well as
recurring) phrases, metaphors, figurative and symbolic language [48]. The hymns’ repetitive
and predictable use of language may make them more accessible to models, even though their
archaic and nuanced style may present challenges.
   After collecting the texts, we found that some of these simplicity/complexity assumptions are

11
   The Old Man and the Sea being studied in schools across the world [44].
12
   As of today, The Old Man and the Sea has over 1 million ratings on GoodReads.
13
   Plath’s prose work The Bell Jar has around 1 million rations on GoodReads, and her poems appear among the top
   250 assigned works on English Literature college syllabi.
14
   Note that the Danish term used “lyrik” encompasses poetry and songs.
15
   Fiction4 includes all 40 poems in Ariel.
16
   Fiction4 includes three of Andersen’s most known fairy tales: “The Little Mermaid” (1837), “The Ugly Duckling”
   (1844), and “The Shadow” (1847), in an edition where spelling has been slightly modernized [21].
17
   Fiction4 include 65 hymns from three different ofÏcial hymnal books from the years 1798 (𝑛 = 35), 1857 (𝑛 = 17),
   and 1873 (𝑛 = 13). Years refer to publication years of three ofÏcial church hymn collections, and hymns are
   collected at random.


                                                       686
reflected in the correlation between human and model valence scores – at least when using our
method – where Plath’s poetry shows the lowest correlation, and hymns the highest correlation
(Table 1).

Reference corpora and datasets:

EmoBank The EmoBank is a multigenre corpus with human annotations for valence (𝑛 =
8, 735),18 with 10 annotators per sentence [19].19 The corpus was composed from various cate-
gories in the Manually Annotated Sub-Corpus of the American National Corpus (MASC),20 con-
sisting of texts from 1990 and onwards [33]. We consider the EmoBank categories: Letters,
Blog, Newspaper, Essays, Fiction, and Travel guides, which are relatively balanced (Table 1)
including both longer and shorter texts within each category.21

FB The Facebook corpus of posts (FB)22 collected between 2009 and 2011, consists of 2,895
status updates, each by a unique user, with human annotations for valence and arousal, with 2
annotators per post [53]. The FB dataset differs from our other corpora, consisting of posts –
not sentences. While some posts are short (e.g., “:)” and “LOL”), the average length of posts is
comparable to the average sentence length in, e.g., EmoBank (Table 1).

   Beyond these corpora, we also include two datasets without valence annotation for compar-
ison in terms of feature levels:

Image-captions (𝑛 = 3, 334, 173) of the Conceptual Captions dataset [66], of which we con-
sider feature values to represent a “high-water mark” (i.e. high level) of language relying on
object description and visuality.23

Participants free emotion event descriptions (𝑛 = 6, 898)24 of the International Survey
on Emotion Antecedents and Reactions (ISEAR) [71],25 of which we consider feature values
to represent a “high-water mark” of language dealing with interiority (i.e., relating to inside
sensations and self) and emotionality.
18
   We exclude the heterogeneous SemEval category, as well as very short strings of sentences (noise) across the
   categories (length < 2).
19
   5 annotators annotated each sentence for valence from a “reader” and “writer” perspective, i.e., 10 valence anno-
   tations per sentence. The valence scores represent a weighted average. See the documentation.
20
   Which is in turn a subset of the American National Corpus.
21
   The category ‘essays’, for example, comprises 8 texts, including the essay “A Brief History of Steel in Northeastern
   Ohio” or one on discrimination. ‘Fiction’ comprises 6 works of various genres, e.g., Richard Harding’s “A Wasted
   Day” and the SciFi story “Captured Moments”. Newspapers include various short reports (e.g. “A.L. Williams
   Corp. was merged into Primerica Corp.” etc.) and longer reportages. Note that Travel Guides are generally
   written in a running prose, and includes both place-histories (e.g. “A brief history of Jerusalem”) and current-day
   reflections (e.g. “Dublin and the Dubliners”). See the full MASC corpus here.
22
   https://github.com/wwbp/additional_data_sets/
23
   https://github.com/google-research-datasets/conceptual-captions
24
   To exclude noise and non-answers (e.g., “I cannot remember”), we set an arbitrary threshold of 30 tokens for a
   description to be included, resulting an a diminished dataset from the original 7,659 datapoints.
25
   https://github.com/sinmaniphel/py_isear_dataset/


                                                        687
4. Methods
4.1. Human and automatic sentiment annotation
Model annotation Multilingual transformer-based models have shown best performance in
SA for literary texts across languages (also for historical texts)[11, 3, 65] compared to dictionary-
based approaches explicitly developed for literary texts as well as monolingual English models
[11]. We, therefore, used the RoBERTa base xlm multilingual, finetuned for sentiment analysis
on Twitter data,26 which is comparable to the state-of-the-art models in a monolingual (non-
literary) setting [7], and shows the best performance in the limited studies there are on liter-
ary prose [11].27 XML-RoBERTa28 was developed through a cross-lingual language training
method, designed to boost its proficiency in comprehending and processing multiple languages
by transferring skills it has acquired from one language to another.
   With this model, we scored all sentences across our bilingual datasets.29 The model returns
polarity positive or negative, and a neutral label. To attain more continuous, nuanced data from
the transformers’ categorical output, we opted for using the same strategy as in Bizzoni and
Feldkamp [11], i.e., using the confidence score of model labels as a proxy for sentiment intensity.
For example, a sentence with a positive label and a confidence of, e.g., 0.75, is interpreted as
a valence score of +0.75. Similarly, a negative label with confidence of 0.89, is interpreted as
a valence score of -0.89. For the neutral category, confidence is disregarded and levelled to
a score of 0 (midscale or “neutral”). 30 The correlation between human mean score and the
transformed RoBERTa score appears high across our selected corpora (Table 1).
   As seen in Table 1, RoBERTa values correlate most strongly with human annotations of the
FB dataset, while correlations with annotations of fiction (EmoBank and Fiction4) are much
lower, possibly reflecting better development the model for certain more colloquial domains
(social media, blogs, letters).


26
   We used this model off-the-shelf, so that the hyperparameters are as reported in: https://huggingface.co/cardiff
   nlp/twitter-xlm-roberta-base-sentiment/blob/main/config.json.
27
   Note that recent studies have tested newer, generative models for literary SA, notably Rebora, Lehmann,
   Heumann, Ding, and Lauer [56]. For this study, we excluded GPTs from our pool of tools. As our interest is
   not achieving top performance, but rather understanding the differences between SA tools and human annota-
   tion, we sought to employ only models that were designed for sentiment analysis and that don’t depend on prompt
   engineering.
28
   https://huggingface.co/docs/transformers/en/model_doc/xlm-roberta
29
   For the Danish texts, we tried the model both on the original Danish, on non-validated Google translations, and on
   manually checked and revised Google translations. We chose to use the model’s output on validated translations,
   since the those valence scores correlated best with human annotations – the correlation with human mean values
   when applying the model on hymns and fairy tales in the original Danish was 𝜌 > .45, 𝑝 < .01, on Google-
   translations 𝜌 > .61, 𝑝 < .01, and on validated English translations 𝜌 > .65, 𝑝 < .01.
30
   Naturally, there are caveats to transforming sentiment polarity to continuous valence scores in this way. How-
   ever, the approach has been shown to outperform dictionary-based (outputting continuous scores by design)
   and to approximate a human continuous valence annotation in literary prose [11]. Note that the distribution of
   transformed scores still tend to “look polar” as confidence score tend to be generally high, see Fig. 6, Appendix.


                                                       688
Figure 1: Groupsizes: the full Fiction4, the filtered subset (where human valence was below 4.5 and
above 5.5 on the 0-10 scale), as well as the implicit and explicit groups. Number of sentences on the
x-axis.


Human annotation of Fiction4 Human annotators (at least 𝑛 = 2/line) read the literary texts
from beginning to end, scoring each line on a 0 to 10 valence scale:31 0 signifying the lowest,
and ten the highest valence.32
   The valence score was intended to represent the sentiment the sentence and verse expressed,
and annotators were instructed to avoid rating how a sentence or verse made them feel and to
try to report only on the sentiments embedded in the sentence, i.e., to think about the valence
of the individual sentence and verse, without overthinking the story’s/poetry’s narrative.
   It is worth noting that humans rarely reach an agreement higher than 80% (or 0.80 Krippen-
dorff’s 𝛼) for tasks like positive/neutral/negative discrete tagging [74] on nonliterary texts –
and have lower agreement for continuous scale polarity annotation [8], especially for literary
texts [56].

4.2. Sentence subsets
For our first experiment, exploring the prevalence of semantic traits in sentences where hu-
mans/model disagree and sentences where they agree, we divided our Fiction4 corpus into two
groups. First, we filtered out sentences which humans did not perceive any strong sentiment
(i.e., with human valence scores between 4.5 and 5.5 on the 0-10 scale). On one hand, we
then took sentences in which our chosen model did not assign any strong sentiment (below
an absolute score of 0.1, i.e., between -0.1 and +0.1)33 and, on the other hand, sentences where
it did. With this procedure, we distilled two groups of sentences (Fig. 1): one of sentences
with humans/model disagreement, which we call the “implicit” group (𝑛 = 1, 194) and one of
human/model agreement, which we call the“explicit” group (𝑛 = 2, 631).

4.3. Features
Based on previous work (section 2), we include three previously used semantic traits to ex-
amine their bearing on instances of implicit sentiment evocation: at the sentiment dimension,


31
  “Lines” refer to sentences in the case of prose and to verse-lines in the case of the hymns/poetry. Sentences were
   tokenized using the nltk tokenize package.
32
   Annotators were researchers, three with a background in literary studies and one in cognitive science. The two
   annotators of the hymns (MA and PhD of literature) had domain knowledge in 19th century Scandinavian literature
   and historical religious hymns.
33
   Note that the model valences range from -1 to 1 (negative to positive), where 0 represents neutral.


                                                       689
arousal,34 and dominance,35 and at the sensorimotor dimension, concreteness,36 imageability,
as well three additional sensory traits, visuality,37 haptic,38 and interoception.39 We use the
datasets below to measure sentence semantic trait values, averaging the score per feature for
each sentence.

Concreteness lexicon: The lexicon by Brysbaert, Warriner, and Kuperman [18] provides
concreteness ratings for 37,058 English words. Annotators were recruited via Mturk (English
native speakers). Each word was annotated by at least 25 annotators, on a scale from 1 (=most
abstract, i.e., what cannot be experienced directly but the meaning of which is defined by other
words) to 5 (=most concrete, i.e., what can be experience directly through one of the five senses).
These ratings have been widely used [22] also in the literary domain [6].

NRC-VAD lexicon The lexicon by Mohammad [45] provides ratings of 20,000 English words
on three sentiment dimensions (valence, arousal, dominance). Annotators were recruited via
CrowdFlower, and each word was annotated by at least 6 annotators with a best/worst scaling
approach (e.g. most arousal vs. least arousal). The lexicon has been used widely, as well as
integrated in the SA tool VADER [32].

The Lancaster Sensorimotor Norms: The dataset provides norms of sensorimotor strength
for 39,707 English words across 6 perceptual modalities (haptic, auditory, olfactory, gustatory,
visual, and interoceptive)[42]. While the dataset includes action effectors (i.e., body parts) we
used only the general perceptual norms, selecting only those we deemed especially relevant to
the idea of objective correlative (i.e., material, visual objects/situations): visuality, interocep-
tion and hapticity. The perceptual part of the dataset had 2,625 annotators recruited via Mturk.
Each word was rated from 0 (=not experienced with sense X) to 5 (=experienced greatly with
sense X). These perceptual modality ratings have been used in, e.g., metaphor detection [72],
and have served as a form of “embodied experience” information to enrich and improve lan-
guage models [37].

Imageability The MRC Psycholinguistic Database (MRCPD) provides 26 linguistic and psy-
cholinguistic variables for 150,837 English words – a subset of which are 9,240 words words
rated for imageability [23]. These words have been rated by annotators. Ratings reflect how
easily a word can evoke mental imagery, and fall in the range 100 − 700. The lexicon has been
used variously, e.g., in metaphor [30] and literary studies [36].


34
   The degree to which a word prepares for action, captures or focuses attention [13].
35
   The degree of control evoked [73].
36
   The degree to which a word denotes a perceptible entity [18].
37
   The degree to which a word is experienced with the eyes [42].
38
   The degree to which a word is experienced by touch [42].
39
   The degree to which a word is experienced by sensations inside the body [42].


                                                       690
Table 2
Inter Rater Reliability between annotators across literary genres, using the mean (𝑥)̄ of Spearman’s 𝜌
between pairs (for all, 𝑝 < .01) – with Krippendorff’s 𝛼 for reference.
                                             Hymns       Fairy tales      Prose     Poetry
                      Spearman’s 𝜌 (𝑥)̄        0.73           0.68         0.62       0.59
                      Krippendorff’s 𝛼         0.72           0.69         0.64       0.59


Figure 2: Distribution of feature values for the two groups: implicit and explicit groups of sentences
in the Fiction4 corpus.


5. Results
5.1. Human annotation
We report a relatively high inter-rater reliability (IRR): between annotators, we find a mean
correlation (Spearman’s 𝜌) from 0.59 for poetry to 0.73 for hymns (Table 2).40 IRR is high,
especially for hymns, considering both the fragmentariness of the verses, and that humans
tend to have low agreement for continuous-scale annotation (Section 4.1).

5.2. Experiment 1
For our first experiment we compared the two groups of sentences in Fiction4 along each of
the chosen features. We report the Mann-Whitney U-test effect’s size and significance levels in
Table 3.41 We find that the strongest effect size is for interoceptive values (Table 3), while visu-
ally, concreteness shows two noteable “peaks” (Fig. 2).42 Overall, we can confirm the difference

40
   As annotators operated within a continuous valence spectrum, divided into ten categories, we find that a cor-
   relation measure more clearly reflects direction and nuance of annotations (parallelity vs exactness), compared
   to categorical IRR measures. Therefore, we report Spearman’s 𝜌 and provide Krippendorff’s 𝛼 for reference (the
   level of measurement is considered interval).
41
   For the test, we dropped sentences with NaN-values in the specific feature we were testing, the number of dropouts
   was < 40 in each test.
42
   The results of the Mann-Whitney U test are supported by a linear regression, where we sought to model the two
   groups by each feature. Significant results of the linear regression correspond to those indicated by the Mann-
   Whitney U-test, see Table 8 in Appendix.


                                                       691
Table 3
Mann-Whitney U test on the implicit vs. explicit group of sentences (𝑛 = 1, 194/2, 631) in the Fiction4
corpus. Note that, for better readability, we have divided the effect size by 100 here. *𝑝 < .01, numbers
in grey are 𝑝 > .05.
                 Concret.      Arousal     Dominance       Imag.      Visual     Haptic     Interocept.
     Fiction4      1,278*       1,807*         1,636*      1,538      1,413*     1,461*          1,975*


Table 4
Mann-Whitney U test on the implicit vs. explicit group of sentences in the reference corpora. Again,
the effect size has been divided by 100 for readability. *𝑝 < .01, numbers in grey are 𝑝 > .05.
                   Concret.      Arousal    Dominance       Imag.       Visual     Haptic     Interocept.
  EmoBank             2,205*      3,320*         2,940*      2,429*     2,528*      2,745          3,380*
  Letters               759*      1,142*         1,125*        797*        903      1,034          1,194*
  Fiction             2,137*      2,959*          2,362       2,223     2,312*      2,348          2,865*
  Blog                  414*        658*           613*         475       473*        500            661*
  Newspaper             464*        685*            591         542        547        607            762*
  Essays                 222        350*            244         245        246        232            303*
  Travelguides          268*        434*           448*        307*       314*        385            489*
  FB                   1,264      1,501*           1,205      1,327      1,333      1,373          1,494*


between groups for the three features previously tested [10], while adding the observation of
slight differences also for language heavy in visual and haptic information, as well as a robust
difference for interoceptive information in the implicit group.
   For reference, we conducted the same experiment on our reference corpora EmoBank and
FB, dividing the data into implicit and explicit groups of sentences as outlined in Section 4.2.
These results are reported in table 4. Histograms to support this difference in feature values
between groups in EmoBank can be found in Appendix, Fig. 7 – as in Fiction4, levels of arousal,
dominance and interoceptive are lower in the implicit group, while concreteness is higher in
the implicit group. In the reference corpora (table 4), the strongest effect size is tendentially,
as in Fiction4, interoceptive values. Notably, interoceptive and arousal values hold significant
discriminating power across all corpora, as well as concreteness if we disregard the FB corpus.
We see important differences within the EmoBank, where all features appear important only in
the more personal or imaginative genres, and not in newspaper and essays.

5.3. Experiment 2
In the second experiment, we check for correlations (rather than simple statistical difference)
between human/model disagreement and the level of each semantic trait per sentence. This
allows us to observe whether the presence of “undetected sentiment” in text has a linear relation
with any of the semantic dimensions selected. For this, we used only sentences found in the
implicit group (as outlined in Section 4.2), so that we correlated the amount of disagreement
between human and model with our chosen features in sentences where humans perceived


                                                  692
Table 5
Spearman’s 𝜌 between disagreement (absolute human/RoBERTa score difference) and feature score in
the FB, EmoBank, and Fiction4 corpora. Note that for these correlations, we have filtered out sentences
shorter than five words. For all correlations in black: 𝑝 < .05, with *: 𝑝 < .01.
                          Concret.     Arousal     Dominance        Imageab.      Visual    Haptic     Interocept.
     FB                      0.03        -0.04          0.17*          -0.05       -0.02      -0.02        0.03
     EmoBank                 0.06*        0.01          0.05*          0.03        0.09*      0.02        -0.10*
        Letters              0.01          0.02          0.06          0.02         0.02     -0.04         -0.03
        Blog                 0.16*       -0.12*         -0.07          0.06        0.16*      0.10          0.07
        Newspaper            -0.03         0.02          0.06          0.02         0.09      0.08          0.01
        Essays               -0.13        -0.01         0.16*         -0.15*       -0.05     -0.03         -0.09
        Fiction              0.04         -0.08         -0.06          0.01         0.06      0.02        -0.12*
        Travelguides         0.17*        0.08          0.03           0.15*        0.03     0.12*          0.01
     Fiction4                -0.05       -0.09*         0.12*          0.01         0.01      -0.05        0.05
          Hymns              -0.02        -0.05         0.22*          0.06         0.01      -0.03        0.01
          Fairy Tales        0.05        -0.16*         -0.03          -0.02        0.06      -0.02       -0.25*
          Prose              -0.07       -0.08*          0.06           0.03       -0.02      -0.02        0.04
          Poetry             0.02        -0.13*          0.09           0.02        0.07      -0.04        0.04


sentiment, but models did not. We report our results in Table 5.
   First of all, not all patterns of sentiment implicitness as seen in Experiment 1 are detectable
as a correlation, suggesting that some of these features do not impact sentiment evocation
linearly. On the other hand, we do see correlations that point to interesting genre differences
in how sentiment is perceived in texts. For Fiction4, we find a consistent negative correlation
between human/model disagreement and arousal which aligns with the lower levels of arousal
in the implicit group we saw in Experiment 1. While concreteness and interoception do not
show consistent linear correlations with disagreement, effects of low interoception related to
higher disagreement are evident in Andersen’s fairy tales. 43
   Within Fiction4, the role of high concreteness paired with higher dominance, and, to a lesser
extent, lower arousal in sentiment disagreement (which we link to evocation) is confirmed, as
well as the negative correlation of disagreement with interoception.
   For comparison, we redid correlations in the reference corpora detailed in Section 3.1. Here,
positive correlations are also found with concreteness: the more concrete a sentence is, the
more our SA model’s sentiment judgment will differ from that of human’s. The strongest role
of concreteness in sentiment disagreement appears to be not in literary texts proper, but in
the travel guides and letters contained in EmoBank, and in blogs. Interoception also holds a
negative correlation with disagreement Fiction category in EmoBank, as it did with Fairy tales
in Fiction4.
   Interestingly, the negative correlation of arousal with disagreement is not as consistent in
the reference corpora, where we only see a negative correlation in Blogs. We find spurious
positive correlations of disagreement with dominance, and visual – notably, however, correla-
43
     The absence of a linear relation with concreteness is particularly interesting, given the results in Experiment 1.
     Concreteness appears to have an effect on the evoked sentiment for human readers, but the two elements are not
     systematically related - the evoked sentiment does not change linearly with an increase in concreteness.


                                                          693
Table 6
Mean and SD of feature scores for different types. Since feature values show very slight difference
across genres, to represent “highwater marks”, we have added the mean values of image-captions and
free emotion event descriptions (Section 3.1). Numbers in green represent the highest and in red the
lowest values.
                   Concret.       Arousal      Dominance      Imageab.        Interocept.   H/R disagr.
 FB                2.70 ± 0.45   0.46 ± 0.11   0.52 ± 0.11   356.74 ± 58.47   1.43 ± 0.53   0.32 ± 0.20
 EmoBank           2.65 ± 0.42   0.44 ± 0.10   0.54 ± 0.10   339.60 ± 49.89   1.10 ± 0.47   0.38 ± 0.20
   Letters         2.68 ± 0.40   0.45 ± 0.08   0.57 ± 0.09   349.75 ± 52.19   1.13 ± 0.39   0.33 ± 0.23
   Blog            2.61 ± 0.45   0.45 ± 0.10   0.54 ± 0.10   331.75 ± 52.54   1.12 ± 0.46   0.38 ± 0.20
   Newspaper       2.62 ± 0.32   0.46 ± 0.08   0.57 ± 0.09   329.61 ± 40.49   0.93 ± 0.33   0.39 ± 0.20
   Essays          2.49 ± 0.33   0.45 ± 0.09   0.55 ± 0.08   317.27 ± 41.25   0.96 ± 0.33   0.41 ± 0.19
   Fiction         2.69 ± 0.47   0.43 ± 0.11   0.50 ± 0.11   349.44 ± 50.15   1.32 ± 0.55   0.39 ± 0.20
   Travelguides    2.81 ± 0.43   0.43 ± 0.08   0.54 ± 0.07   349.25 ± 50.04   0.83 ± 0.27   0.39 ± 0.22
 Fiction4          2.72 ± 0.46   0.43 ± 0.12   0.51 ± 0.13   353.90 ± 50.88   1.26 ± 0.53   0.35 ± 0.21
     Hymns         2.58 ± 0.43   0.45 ± 0.13   0.56 ± 0.13   351.46 ± 47.78   1.39 ± 0.54   0.30 ± 0.21
     Fairy tales   2.70 ± 0.37   0.42 ± 0.09   0.50 ± 0.11   349.99 ± 39.55   1.18 ± 0.47   0.34 ± 0.21
     Prose         2.72 ± 0.36   0.42 ± 0.11   0.49 ± 0.10   348.36 ± 37.53   1.22 ± 0.45   0.39 ± 0.20
     Poetry        2.90 ± 0.56   0.42 ± 0.14   0.47 ± 0.12   365.96 ± 69.00   1.16 ± 0.59   0.38 ± 0.19

 Captions          3.12 ± 0.36   0.42 ± 0.10   0.51 ± 0.09   384.71 ± 52.78   0.81 ± 0.28        -
 Emotion ev.       2.60 ± 0.30   0.46 ± 0.09   0.51 ± 0.09   349.85 ± 33.50   1.32 ± 0.34        -


tions where they appear tend to have the same negative or positive direction across all corpora
(including Fiction4), with the exception of imageability. Facebook posts in FB seem to have no
significant link to many of these channels.

5.4. Genre differences
While most datasets seem to exploit some form of trade-off between concreteness, on one
side, and arousal, dominance and interoceptive on the other, relatively few show correlations
with the visual and haptic semantic information, as well as with imageability. The exceptions
are blogs in EmoBank, which shows a weak but positive correlation with visual and with hu-
man/model disagreements; and travel guides where disagreement has a correlation with haptic
and imageability. FB and the EmoBank newspaper and letter category return non-significant
correlations with most dimensions, with the exception of dominance for FB.
   Genre differences in the overall use of these semantic traits can be observed in Table 6. Note
that, for example, Facebook posts seem to have high values of imageability. Still, at the same
time, imageability in posts displays no correlation to the absolute disagreement between model
and human (Table 5). In other words, it may be that although the language of posts is highly
imageable, the images are not used in a way that subtly evokes human emotion and that chal-
lenges models as much as it happens in, for example, travel guides. Similarly, literary genres
(in EmoBank and Fiction4) seem to have high values for imageability and visual scores (Table 6),
but these dimensions exhibit little correlation with human/model disagreement.


                                                  694
Figure 3: Correlation between concreteness and imageability, haptic and visual. Note Spearman’s 𝜌
at the top for each plot.


5.5. Relation between features
Through all our datasets, concreteness has a positive relation with disagreement – showing
higher levels where models are unable to capture the sentiment that humans perceive – as
much as interoception has a negative one. In general, the opposite strength of concreteness
and interoception in all of our datasets appears to confirm our intuition that interoception
works as a sort of anti-concreteness when it comes to the evocation of sentiments, as the usage
of external objects and “things” to evoke sentiments in the reader will make a low recourse to
the interoceptive dimension. The fact that visual, haptic and imageablity correlations, when
relevant, tend to the same direction as concreteness also adds to this hypothesis.
   It is intriguing that a positive correlation between human/model disagreement and concrete-
ness tend to co-occur with a positive correlation with sensory norms. The intuition that what
is concrete as something “that is perceived through the senses” or “that can be drawn” would
have led us to expect correlations of, e.g., haptic and concreteness to co-occur. On the other
hand, concreteness does not have to occur with explicit sensory information at all: many words
like house, sea, or wood, do not peak on one specific sense, and yet are considered fairly con-
crete; and some words like melody or rhythm might be less concrete and yet have sensory
associations. The kind of concreteness that matters here might be more related to a general
physical materiality rather than to a specific sensory load.
   Concreteness exhibits a stronger correlation with visual, haptic, interoceptive and image-
ability traits (𝜌 > .5, 𝑝 < .01) than with dominance and arousal (around 𝜌 = .2, 𝑝 < .01).
When correlating terms in the concreteness dictionary with other semantic traits, we find that
especially interoceptive and concreteness show an interesting correlation. Words referring to
internal emotional states, also presumbaly having higher arousal (e.g., “lovesickness”, “hope-
lessness”), tend to cluster in the direction of high interoception and low concreteness. While
concreteness has a robust negative correlation with interoception (𝜌 = −.52, 𝑝 < .01), i.e., high
concreteness words generally are less interoceptive, we find that there is a set of words which
maintain high interoception and high concreteness (upper right in Fig. 4). These words may be
characterized as referring to concrete objects, which are nevertheless associated with internal
(vs. external) states and experiences (e.g. “bladder”, “breath”). Conversely, words in the lower
right corner, with high concreteness and low interoceptive values, appear to be more of the
category objects of external experience (e.g. “jewel”, “clip”, “lightswitch”), less associated with


                                               695
Figure 4: Overlap between the interoceptive and concreteness lexica, showing the values of words in
both lexica. A random set of words is visualized in each pole.


internal sensation (than is, e.g., “bee sting” or “drugs”). Considering the opposite correlation
of human/model disagreement with interoception vs concreteness, we hypothesize that words
used in instances of “objective correlative” would predominantly appear in the lower right cor-
ner of Fig. 4). This means that they are associated with words that are not only more concrete
but also “objective” in the sense of referring to external rather than subjective or internal ex-
periences. Therefore, the “objective correlative” might be understood as “objective” in both
senses: impersonal and focused on external objects.


6. Discussion and conclusion
We have examined the relation between human/model disagreement on sentiment annotations
and a chosen set of semantic traits for a new corpus of literary prose, comparing them with
datasets representing several other domains, and we have extended the semantic traits’ set
used in previous literature to include also the sensory and interoceptive dimensions. Overall,
we confirm previous results obtained on smaller data about relation of semantic traits to the
presence of “undetected” or implicit sentiment in literary fiction, and we have observed similar
trends also in non-fictional domains, with interesting differences between genres.
   The “undetected” sentiments are likely to be evoked, rather than stated, and this evocation
seems to pass through a trade-off between several semantic traits: an increase in concrete-


                                               696
Figure 5: Standardized semantic trait values (scaled to range -1 to 1) in an example sentence. Dots
connect two or more successive values (values are ‘NaN’ if the lemma was not in the respective feature-
lexicon). The mean concreteness for this sentence = 0.4, and mean interoception = 0.25.


ness and a decrease in arousal and interoception.44 These traits seem to align with what in
literary theory has been called “objective correlative”: the strategy of conveying sentiment
(or emotion) through the reference of external, material, or “objective” reality. This seems to
happen together with the downplaying of semantic dimensions related to intensity and con-
trol, contributing to a subtler, less explicit form of emotional communication, which we might
characterize as an omissive evocative strategy. An example of the trade-off between omission
and use of objective correlative can be observed in the following sentence of Hemingway from
Fiction4: ``Ay, he said aloud. There is no translation for this word and per-
haps it is just a noise such as a man might make, involuntarily, feeling the
nail go through his hands and into the wood'' . The sentence was consistently rated
as negative by humans, and neutral by the model (see Fig. 5 above). Note how concreteness
and interoception tend to divert in this sentence (e.g. on “feeling” or “nail”), while arousal and
dominance values are sparse (Fig. 5).
   In the future, we intend to expand our analysis to larger and more diverse corpora, and
integrate more psycholinguistic resources, seeking ultimately to contribute to the develop-
ment of better tools for sentiment analysis in literary genres. We would also like to observe
the relation between reader response or literary reception and the concreteness-dominance or
concreteness-interoception trade-off.


Limitations
We want to underline that our corpus of fiction (Fiction4) is limited, with only one author rep-
resenting three of the four categories (Plath for Poetry, Hemingway for prose, and Andersen
for fairy tales). Moreover, the demographic of our dataset is reduced (in terms of gender, eth-
nicity, age, social class, etc.). Replication of these results on a larger and more diverse corpus
of fiction is needed, and our results should be interpreted with this in mind.


44
     Perhaps surprisingly, the concreteness’ effect did not have a strong link with existing norms of sensory informa-
     tion, but only with interoception, with the partial exception of literary prose.


                                                          697
Online Resources
See https://github.com/centre-for-humanities-computing/literary_evocation for code and
data.


Acknowledgments
We want to thank everyone who contributed to this work, especially Mia Jacobsen, as well as
colleagues and friends for pointing out pitfalls and sharing ideas.


References
 [1] S. Ahmed. The cultural politics of emotion. Edinburgh Univ. Press, 2010.
 [2] H. J. Alantari, I. S. Currim, Y. Deng, and S. Singh. “An empirical comparison of machine
     learning methods for text-based sentiment analysis of online consumer reviews”. In: In-
     ternational Journal of Research in Marketing 39.1 (2022), pp. 1–19. doi: 10.1016/j.ijresma
     r.2021.10.011.
 [3] A. Allaith, K. Degn, A. Conroy, B. Pedersen, J. Bjerring-Hansen, and D. Hershcovich.
     “Sentiment Classification of Historical Danish and Norwegian Literary Texts”. In: Pro-
     ceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa). Ed. by
     T. Alumäe and M. Fishel. Tórshavn, Faroe Islands: University of Tartu Library, 2023,
     pp. 324–334. url: https://aclanthology.org/2023.nodalida-1.34.
 [4] C. O. Alm and R. Sproat. “Emotional Sequencing and Development in Fairy Tales”. In:
     Affective Computing and Intelligent Interaction. Ed. by J. Tao, T. Tan, and R. W. Picard.
     Berlin, Heidelberg: Springer, 2005, pp. 668–674. doi: 10.1007/11573548\_86.
 [5] D. Attridge. Peculiar Language. Routledge, 1988.
 [6] J. Auracher and H. Bosch. “Showing with words: The influence of language concreteness
     on suspense”. In: Scientific Study of Literature 6.2 (2016), pp. 208–242. doi: 10.1075/ssol.6
     .2.03aur.
 [7] F. Barbieri, L. E. Anke, and J. Camacho-Collados. XLM-T: Multilingual Language Models
     in Twitter for Sentiment Analysis and Beyond. 2022. doi: 10.48550/arXiv.2104.12250.
 [8] V. Batanović, M. Cvetanović, and B. Nikolić. “A versatile framework for resource-limited
     sentiment articulation, annotation, and analysis of short texts”. In: PLoS ONE 15.11 (2020).
     doi: 10.1371/journal.pone.0242050. url: https://www.ncbi.nlm.nih.gov/pmc/articles
     /PMC7660500/.
 [9] K. F. Baunvig. “Forestillede fællesskabers virtuelle sangritualer: Forskningsprojekt vil
     kaste lys over den kulturelle betydning af den virtuelle fællessang under corona-tiden”.
     In: Tidsskriftet SANG 1.1 (2020), pp. 40–45. doi: 10.7146/sang.v1i1.137029.


                                              698
[10]   Y. Bizzoni and P. Feldkamp. “Below the Sea (with the Sharks): Probing Textual Features
       of Implicit Sentiment in a Literary Case-study”. In: Proceedings of the Third Workshop
       on Understanding Implicit and Underspecified Language. Ed. by V. Pyatkin, D. Fried, E.
       Stengel-Eskin, A. Liu, and S. Pezzelle. Malta: Association for Computational Linguistics,
       2024, pp. 54–61. url: https://aclanthology.org/2024.unimplicit-1.5.
[11]   Y. Bizzoni and P. Feldkamp. “Comparing Transformer and Dictionary-based Sentiment
       Models for Literary Texts: Hemingway as a Case-study”. In: Proceedings of the 3rd Inter-
       national Workshop on Natural Language Processing for Digital Humanities. Tokyo, Japan:
       Association for Computational Linguistics, 2023, pp. 219–226. url: https://rootroo.com
       /downloads/nlp4dh%5C%5Fiwclul%5C%5Fproceedings.pdf.
[12]   W. C. Booth. The Rhetoric of Fiction. 2nd edition. Chicago: University of Chicago Press,
       1983.
[13]   E. Borelli, D. Crepaldi, C. A. Porro, and C. Cacciari. “The psycholinguistic and affective
       structure of words conveying pain”. In: PloS one 13.6 (2018), e0199658.
[14]   K. Bowers and Q. Dombrowski. Katia and the Sentiment Snobs. 2021. url: https://datasit
       tersclub.github.io/site/dsc11.html.
[15]   C. Britzolakis. “Ariel and other poems”. In: The Cambridge Companion to Sylvia Plath. Ed.
       by J. Gill. Cambridge University Press, 2006, pp. 107–123. doi: 10.1017/ccol0521844967.
[16]   C. Brooks. The well wrought urn: studies in the structure of poetry. Harcourt, 1947.
[17]   B. Brown. “Thing Theory”. In: Critical Inquiry 28.1, (2001), pp. 1–22. url: http://www.js
       tor.org/stable/1344258.
[18]   M. Brysbaert, A. B. Warriner, and V. Kuperman. “Concreteness ratings for 40 thou-
       sand generally known English word lemmas”. In: Behavior Research Methods 46.3 (2014),
       pp. 904–911. doi: 10.3758/s13428-013-0403-5.
[19]   S. Buechel and U. Hahn. “EmoBank: Studying the Impact of Annotation Perspective and
       Representation Format on Dimensional Emotion Analysis”. In: Proceedings of the 15th
       Conference of the European Chapter of the Association for Computational Linguistics: Vol-
       ume 2, Short Papers. Ed. by M. Lapata, P. Blunsom, and A. Koller. Valencia, Spain: Asso-
       ciation for Computational Linguistics, 2017, pp. 578–585. url: https://aclanthology.org
       /E17-2092.
[20]   J. Burroway. Writing Fiction: A Guide to Narrative Craft. Little, Brown, 1987.
[21]   C. CCLM. Danske børn og unge har stort kendskab til H.C. Andersen. 2003. url: https://d
       pu.au.dk/om-dpu/nyheder/nyhed/artikel/danske-boern-og-unge-har-stort-kendskab-ti
       l-hc-andersen.
[22]   J. Charbonnier and C. Wartena. “Predicting Word Concreteness and Imagery”. In: Pro-
       ceedings of the 13th International Conference on Computational Semantics - Long Papers.
       Ed. by S. Dobnik, S. Chatzikyriakidis, and V. Demberg. Gothenburg, Sweden: Association
       for Computational Linguistics, 2019, pp. 176–187. doi: 10.18653/v1/W19-0415.
[23]   M. Coltheart. “The MRC Psycholinguistic Database”. In: The Quarterly Journal of Exper-
       imental Psychology Section A 33.4 (1981), pp. 497–505. doi: 10.1080/14640748108400805.


                                               699
[24]   M. Daoshan and Z. Shuo. “A Discourse Study of the Iceberg Principle in A Farewell to
       Arms”. In: Studies in Literature and Language 8.1 (2014), pp. 80–84.
[25]   A. Doche and A. S. Ross. “‘Here is my shameful confession. I don’t really “get” poetry’:
       discerning reader types in responses to Sylvia Plath’s Ariel on Goodreads”. In: Textual
       Practice 37.6 (2023), pp. 976–996. doi: 10.1080/0950236x.2022.2082516.
[26]   T. Eliot. Selected Essays by T. S. Eliot. Faber & Faber, 1948.
[27]   K. Elkins. The Shapes of Stories: Sentiment Analysis for Narrative. Cambridge University
       Press, 2022. doi: 10.1017/9781009270403.
[28]   H. Elsahar and M. Gallé. “To Annotate or Not? Predicting Performance Drop under
       Domain Shift”. In: Proceedings of the 2019 Conference on Empirical Methods in Natural
       Language Processing and the 9th International Joint Conference on Natural Language Pro-
       cessing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics,
       2019, pp. 2163–2173. doi: 10.18653/v1/D19-1222. url: https://aclanthology.org/D19-122
       2.
[29]   M. Fludernik. “Towards a ‘Natural’ Narratology”. In: Jlse 25.2 (1996), pp. 97–141. doi:
       10.1515/jlse.1996.25.2.97.
[30]   A. Gargett, J. Ruppenhofer, and J. Barnden. “Dimensions of Metaphorical Meaning”. In:
       Proceedings of the 4th Workshop on Cognitive Aspects of the Lexicon (CogALex). Ed. by M.
       Zock, R. Rapp, and C.-R. Huang. Dublin, Ireland: Association for Computational Linguis-
       tics and Dublin City University, 2014, pp. 166–173. doi: 10.3115/v1/W14-4721.
[31]   C. P. Heaton. “Style in The Old Man and the Sea”. In: Style 4.1 (1970), pp. 11–27. url:
       https://www.jstor.org/stable/42945039.
[32]   C. Hutto and E. Gilbert. “VADER: A parsimonious rule-based model for sentiment analy-
       sis of social media text”. In: Proceedings of the international AAAI conference on web and
       social media. Vol. 8. 2014, pp. 216–225. doi: 10.1609/icwsm.v8i1.14550.
[33]   N. Ide, C. Baker, C. Fellbaum, and R. Passonneau. “The Manually Annotated Sub-Corpus:
       A Community Resource for and by the People”. In: Proceedings of the ACL 2010 Confer-
       ence Short Papers. Ed. by J. Hajič, S. Carberry, S. Clark, and J. Nivre. Uppsala, Sweden:
       Association for Computational Linguistics, 2010, pp. 68–73. url: https://aclanthology.o
       rg/P10-2013.
[34]   R. Jakobson. “Linguistics and Poetics”. In: Linguistics and Poetics. De Gruyter Mouton,
       2010 (1981), pp. 18–51. doi: 10.1515/9783110802122.18.
[35]   M. Jockers. A Novel Method for Detecting Plot. 2014. url: https://www.matthewjockers
       .net/2014/06/05/a-novel-method-for-detecting-plot/.
[36]   J. T. Kao and D. Jurafsky. “A computational analysis of poetic style: Imagism and its
       influence on modern professional and amateur poetry”. In: Linguistic Issues in Language
       Technology 12 (2015). url: https://aclanthology.org/2015.lilt-12.3.


                                                700
[37]   C. Kennington. “Enriching Language Models with Visually-grounded Word Vectors and
       the Lancaster Sensorimotor Norms”. In: Proceedings of the 25th Conference on Computa-
       tional Natural Language Learning. Ed. by A. Bisazza and O. Abend. Online: Association
       for Computational Linguistics, 2021, pp. 148–157. doi: 10.18653/v1/2021.conll-1.11.
[38]   E. Kim and R. Klinger. “A Survey on Sentiment and Emotion Analysis for Computational
       Literary Studies”. In: Zeitschrift für digitale Geisteswissenschaften (2019). doi: 10.17175/2
       019\_008. url: http://arxiv.org/abs/1808.03137.
[39]   S.-T. Kousta, G. Vigliocco, D. P. Vinson, M. Andrews, and E. Del Campo. “The represen-
       tation of abstract words: Why emotion matters.” In: Journal of Experimental Psychology:
       General 140.1 (2011), pp. 14–34. doi: 10.1037/a0021446.
[40]   Z. Li, Y. Zou, C. Zhang, Q. Zhang, and Z. Wei. “Learning Implicit Sentiment in Aspect-
       based Sentiment Analysis with Supervised Contrastive Pre-Training”. In: Proceedings of
       the 2021 Conference on Empirical Methods in Natural Language Processing. Ed. by M.-F.
       Moens, X. Huang, L. Specia, and S. W.-t. Yih. Online and Punta Cana, Dominican Repub-
       lic: Association for Computational Linguistics, 2021, pp. 246–256. doi: 10.18653/v1/2021
       .emnlp-main.22. url: https://aclanthology.org/2021.emnlp-main.22.
[41]   T. Lundskær-Nielsen. “The Language of Hans Christian Andersen’s Fairy Tales – Com-
       pared with Earlier Tales”. In: Scandinavistica Vilnensis 1.9 (2014), pp. 97–112. doi: 10.15
       388/ScandinavisticaVilnensis.2014.9.8. url: https://www.journals.vu.lt/scandinavistica
       /article/view/14002.
[42]   D. Lynott, L. Connell, M. Brysbaert, J. Brand, and J. Carney. “The Lancaster Sensorimotor
       Norms: multidimensional measures of perceptual and action strength for 40,000 English
       words”. In: Behavior Research Methods 52.3 (2020), pp. 1271–1291. doi: 10.3758/s13428-0
       19-01316-z.
[43]   M. M. Maslej, R. A. Mar, and V. Kuperman. “The textual features of fiction that appeal to
       readers: Emotion and abstractness.” In: Psychology of Aesthetics, Creativity, and the Arts
       15.2 (2021), pp. 272–283. doi: 10.1037/aca0000282.
[44]   J. Meyers, ed. Hemingway: The Critical Heritage. Routledge, 1982.
[45]   S. Mohammad. “Obtaining Reliable Human Ratings of Valence, Arousal, and Dominance
       for 20,000 English Words”. In: Proceedings of the 56th Annual Meeting of the Association
       for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association
       for Computational Linguistics, 2018, pp. 174–184. doi: 10.18653/v1/P18-1017.
[46]   C. Molesworth. “”With Your Own Face On”: The Origins and Consequences of Confes-
       sional Poetry”. In: Twentieth Century Literature 22.2 (1976), pp. 163–178. doi: 10.2307/44
       0682.
[47]   J. Mukařovský. “Standard language and Poetic Language”. In: A Prague School Reader on
       Esthetics Literary Structure, and Style. Ed. by P. L. Garvin. 1932. Georgetown University
       Press, 1964, pp. 17–30.
[48]   M. A. Nielsen. “Salmesprog”. In: Dansk Sproghistorie Bind 4. Sprog i brug. Aarhus Univer-
       sity Press and Society for Danish Language and Literature (DSLDK), 2020.


                                                701
[49]   B. Ohana, S. J. Delany, and B. Tierney. “A Case-Based Approach to Cross Domain Sen-
       timent Classification”. In: Case-Based Reasoning Research and Development. Ed. by B. D.
       Agudo and I. Watson. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer,
       2012, pp. 284–296. doi: 10.1007/978-3-642-32986-9\_22.
[50]   C. E. Osgood and G. J. Suci. “Factor analysis of meaning.” In: Journal of experimental
       psychology 50.5 (1955), p. 325.
[51]   L. Oulanne. “Lived Things: Materialities of Agency, Affect, and Meaning in the Short
       Fiction of Djuna Barnes and Jean Rhys”. PhD thesis. Helsinki: University of Helsinki,
       2018. url: http:%E2%81%84%E2%81%84ethesis.helsinki.fi.
[52]   E. Pound. “A Few Don’ts by an Imagiste”. In: Poetry 1.6 (1913), pp. 200–206. url: https:
       //www.jstor.org/stable/20569730.
[53]   D. Preoţiuc-Pietro, H. A. Schwartz, G. Park, J. Eichstaedt, M. Kern, L. Ungar, and E. Shul-
       man. “Modelling Valence and Arousal in Facebook posts”. In: Proceedings of the 7th Work-
       shop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis.
       Ed. by A. Balahur, E. van der Goot, P. Vossen, and A. Montoyo. San Diego, California:
       Association for Computational Linguistics, 2016, pp. 9–15. doi: 10.18653/v1/W16-0404.
       url: https://aclanthology.org/W16-0404.
[54]   A. J. Reagan, L. Mitchell, D. Kiley, C. M. Danforth, and P. S. Dodds. “The Emotional Arcs
       of Stories Are Dominated by Six Basic Shapes”. In: EPJ Data Science 5.1 (2016), pp. 1–12.
       doi: 10.1140/epjds/s13688-016-0093-1. url: https://epjdatascience.springeropen.com/ar
       ticles/10.1140/epjds/s13688-016-0093-1.
[55]   S. Rebora. “Sentiment Analysis in Literary Studies. A Critical Survey”. In: Digital Human-
       ities Quarterly 17.2 (2023). url: https://www.digitalhumanities.org/dhq/vol/17/2/000691
       /000691.html%5C#kim-klinger2018b.
[56]   S. Rebora, M. Lehmann, A. Heumann, W. Ding, and G. Lauer. “Comparing ChatGPT to
       Human Raters and Sentiment Analysis Tools for German Children’s Literature”. In: Pro-
       ceedings of the Computational Humanities Research Conference 2023, Paris, France, Decem-
       ber 6-8, 2023. Ed. by A. Sela, F. Jannidis, and I. Romanowska. Vol. 3558. CEUR Workshop
       Proceedings. CEUR-WS.org, 2023, pp. 333–343. url: https://ceur-ws.org/Vol-3558/pape
       r3340.pdf.
[57]   V. Rentoumi, G. Giannakopoulos, V. Karkaletsis, and G. A. Vouros. “Sentiment Analysis
       of Figurative Language using a Word Sense Disambiguation Approach”. In: Proceedings
       of the International Conference RANLP-2009. Ed. by G. Angelova and R. Mitkov. Borovets,
       Bulgaria: Association for Computational Linguistics, 2009, pp. 370–375. url: https://acl
       anthology.org/R09-1067.
[58]   I. A. Richards. Principles of Literary Criticism. Routledge, 2003.
[59]   D. Ringgaard and M. R. Thomsen, eds. Danish literature as world literature. Literatures
       as world literature. New York: Bloomsbury Academic, 2017.
[60]   L. M. Rosenblatt. “The Literary Transaction: Evocation and Response”. In: Theory Into
       Practice 21.4 (1982), pp. 268–277. url: https://www.jstor.org/stable/1476352.


                                                702
[61]   M. Sandri, E. Leonardelli, S. Tonelli, and E. Jezek. “Why Don’t You Do It Right? Analysing
       Annotators’ Disagreement in Subjective Tasks”. In: Proceedings of the 17th Conference of
       the European Chapter of the Association for Computational Linguistics. Ed. by A. Vlachos
       and I. Augenstein. Dubrovnik, Croatia: Association for Computational Linguistics, 2023,
       pp. 2428–2441. doi: 10.18653/v1/2023.eacl-main.178.
[62]   B. Sandstrøm. “Salmen - fra kampsang til lovprisning”. In: Dansk Litteraturs Historie 1100-
       1800. Ed. by V. A. Pedersen, M. Schack, and K. P. Mortensen. Gyldendal, 2007.
[63]   E. Savinova and F. Moscoso Del Prado. “Analyzing Subjectivity Using a Transformer-
       Based Regressor Trained on Naı̈ve Speakers’ Judgements”. In: Proceedings of the 13th
       Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Anal-
       ysis. Ed. by J. Barnes, O. De Clercq, and R. Klinger. Toronto, Canada: Association for
       Computational Linguistics, 2023, pp. 305–314. doi: 10.18653/v1/2023.wassa-1.27.
[64]   T. Schmidt and M. Burghardt. “An Evaluation of Lexicon-based Sentiment Analysis Tech-
       niques for the Plays of Gotthold Ephraim Lessing”. In: Proceedings of the Second Joint
       SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences,
       Humanities and Literature. Ed. by B. Alex, S. Degaetano-Ortlieb, A. Feldman, A. Kazant-
       seva, N. Reiter, and S. Szpakowicz. Santa Fe, New Mexico: Association for Computational
       Linguistics, 2018, pp. 139–149. url: https://aclanthology.org/W18-4516.
[65]   T. Schmidt, K. Dennerlein, and C. Wolff. “Using Deep Learning for Emotion Analysis of
       18th and 19th Century German Plays”. In: Fabrikation von Erkenntnis: Experimente in den
       Digital Humanities - (2021). doi: 10.26298/melusina.8f8w-y749-udlf.
[66]   P. Sharma, N. Ding, S. Goodman, and R. Soricut. “Conceptual Captions: A Cleaned, Hy-
       pernymed, Image Alt-text Dataset For Automatic Image Captioning”. In: Proceedings of
       the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
       Papers). Ed. by I. Gurevych and Y. Miyao. Melbourne, Australia: Association for Compu-
       tational Linguistics, 2018, pp. 2556–2565. doi: 10.18653/v1/P18-1238. url: https://aclant
       hology.org/P18-1238.
[67]   E. Stengel-Eskin, J. Guallar-Blasco, and B. Van Durme. “Human-Model Divergence in
       the Handling of Vagueness”. In: Proceedings of the 1st Workshop on Understanding Im-
       plicit and Underspecified Language. Ed. by M. Roth, R. Tsarfaty, and Y. Goldberg. Online:
       Association for Computational Linguistics, 2021, pp. 43–57. doi: 10.18653/v1/2021.unim
       plicit-1.6.
[68]   P. J. Stone, R. F. Bales, J. Z. Namenwirth, and D. M. Ogilvie. “The general inquirer: A
       computer system for content analysis and retrieval based on the sentence as a unit of
       information”. In: Behavioral Science 7.4 (1962), p. 484.
[69]   T. Strychacz. ““The sort of thing you should not admit”: Ernest Hemingway’s Aesthetic
       of Emotional Restraint”. In: Boys Don’t Cry? Rethinking Narratives of Masculinity and
       Emotion in the U.S. Ed. by M. Shamir and J. Travis. Columbia University Press, 2002,
       pp. 141–166. doi: 10.7312/sham12034-009.


                                               703
[70]   S. Ullrich, A. Aryani, M. Kraxenberger, A. M. Jacobs, and M. Conrad. “On the relation
       between the general affective meaning and the basic sublexical, lexical, and inter-lexical
       features of poetic texts– a case study using 57 poems of H.M. Enzensberger”. In: Frontiers
       in psychology 7 (2017), p. 2073.
[71]   H. G. Wallbott and K. R. Scherer. “How universal and specific is emotional experience?
       Evidence from 27 countries on five continents”. In: Social Science Information 25.4 (1986),
       pp. 763–795. doi: 10.1177/053901886025004001.
[72]   M. Wan, Q. Su, K. Ahrens, and C.-R. Huang. “Perceptional and actional enrichment for
       metaphor detection with sensorimotor norms”. In: Natural Language Engineering (2023),
       pp. 1–29. doi: 10.1017/s135132492300044x. url: https://www.cambridge.org/core/jour
       nals/natural-language-engineering/article/perceptional-and-actional-enrichment-for-
       metaphor-detection-with-sensorimotor-norms/0BA36E2578B2AD80CCCE00E6AF6969
       AB.
[73]   A. B. Warriner, V. Kuperman, and M. Brysbaert. “Norms of valence, arousal, and dom-
       inance for 13,915 English lemmas”. In: Behavior research methods 45 (2013), pp. 1191–
       1207.
[74]   T. Wilson, J. Wiebe, and P. Hoffmann. “Recognizing Contextual Polarity in Phrase-Level
       Sentiment Analysis”. In: Proceedings of Human Language Technology Conference and Con-
       ference on Empirical Methods in Natural Language Processing. Ed. by R. Mooney, C. Brew,
       L.-F. Chien, and K. Kirchhoff. Vancouver, British Columbia, Canada: Association for
       Computational Linguistics, 2005, pp. 347–354. url: https://aclanthology.org/H05-1044.
[75]   D. Zhou, J. Wang, L. Zhang, and Y. He. “Implicit Sentiment Analysis with Event-centered
       Text Representation”. In: Proceedings of the 2021 Conference on Empirical Methods in Natu-
       ral Language Processing. Ed. by M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih. Online
       and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021,
       pp. 6884–6893. doi: 10.18653/v1/2021.emnlp-main.551.


                                               704
Table 7
Overview of texts within the four genres of the Fiction4 corpus: the total number of lines – which
are verses (in the case of hymns and poetry) and sentences (in the case of fairy tales and prose) –
the number of words per dataset, the mean number of lines (verses/sentences) per text, the year of
publication, and the number of human annotators. Note that the number of lines also represents the
number of annotations, as they were done on a verse/sentence basis.
                           Texts     Lines     Words      𝑥 ̄ Words/Line             Period     Annotators
        Hymns                 65      2,026     12,798                   6.3       1798-1873               2
        Fairy tales            3        771     18,597                   2.1       1837-1847               3
        Prose                  1      1,900     30,279                  15.7            1952               2
        Poetry                40      1,545     11,576                   7.3            1965               3
        Full Fiction4        109      6,300     73,250                  11.6       1837-1965             >2


Figure 6: Distribution of human and RoBERTa scores for the Fiction4 corpus.

Table 8
Effect sizes of the Mann-Whitney U test on the implicit vs. explicit group of sentences for each genre of
the Fiction4 corpus and the R2 of the linear regression (seeking to model the two groups) for reference.
While R2 coefficients are low, the idea is to show that a difference exists between the groups in terms
of feature values – which nevertheless show large overlaps visually – not to model valence as such via
these features. We also include the test between groups within each genre in Fiction4 separately, and
the test for the FB corpus for comparison. For readability, we have divided the effect size by 100 here.
Note that group sizes of the implicit/explicit sentences differ across the genre, with the smallest groups
compared being fairy tales (𝑖 = 147/𝑒 = 317). Values in black: 𝑝 < .05, with*: 𝑝 < .01.
  Measure    Corpus          Concret.      Arousal     Dominance        Imag.       Visual     Haptic   Interocept.
  MW         Fiction4          1,278*         1,807*           1,636*    1,538       1,413*    1,461*          1,975*
  MW         Hymns               151*           228*             221*      200         163*       183            254*
  MW         Fairy tales          20*             24               23       21           22        22             26*
  MW         Prose               121*           170*              138      139         129*      133*            174*
  MW         Poetry               60*            73*               59       63           68        67             86*

  R2         Fiction4              0.02*       0.03*            0.01*          0      0.01*     0.01*           0.05*


                                                         705
(a) Difference between implicit/explicit groups in arousal, dominance and human annotated arousal.
    We add the latter for reference since it is available in the EmoBank corpus. Note that harousal and
    arousal behave similarly.


(b) Difference between implicit/explicit groups in concreteness, imageability, visual, hapric and intero-
    ceptive levels.
Figure 7: Feature levels in the implicit and explicit groups of the EmoBank corpus.


                                                  706