HarryMotions – Classifying Relationships in Harry Potter based on
                            Emotion Analysis

               Albin Zehe
                        Julia Arns      Lena Hettinger       Andreas Hotho
                    Data Science Chair — University of Würzburg
        [zehe,arns,hettinger,hotho]@informatik.uni-wuerzburg.de


                        Abstract                                tweets, product reviews or news articles, the task
                                                                still poses a significant challenge on other domains.
    Sentiment Analysis has long been a topic                    Fictional literary texts in particular are hard to anal-
    of interest in natural language processing                  yse, since they usually do not express emotions
    and computational literary studies, where                   explicitly, but they have to be inferred from context
    it can be used to infer the relationships                   and possibly world knowledge.
    between fictional characters. Building on                      Recently, the trend in NLP has been to use large
    the dataset and results of Kim and Klinger                  transformer models that have been pre-trained for
    (2019), we propose a classifier based on                    language modelling (or similar tasks not requiring
    BERT that improves the results reported                     explicit annotations) on enormous datasets. We fol-
    therein and show that we can use this clas-                 low this trend by fine-tuning BERT (Devlin et al.,
    sifier to determine the relation between                    2019) to the task of classifying emotions in interac-
    characters in Harry Potter novels. Our                      tions between characters. We use BookNLP (Bam-
    proposed sentiment classifier yields an F1-                 man et al., 2014) to extract entity mentions and
    score of up to 75 % for binary classifica-                  co-references and then fine-tune BERT on the emo-
    tion of emotions. Aggregating these emo-                    tion dataset provided by Kim and Klinger (2019).
    tions over novels, we reach an F1-score of                  Emotions are aggregated to detect overall relations
    up to 68 % for the classification of a pair                 between characters and their development over a
    of characters as friendly or unfriendly.                    novel, as exemplified in Figure 1 (cf. Section 4).
1   Introduction                                                   Our contribution is two-fold: 1. We generally
                                                                improve results on the emotion classification tasks
Characters and their relations are one of the basic             from Kim and Klinger (2019). 2. We track the
building blocks of stories (Hettinger et al., 2015).            emotional relations detected by our classifier over
Detecting them automatically is therefore a highly              the course of a novel and describe an easy method
interesting task for the analysis of fictional texts.           to aggregate them to an overall label. We evaluate
While there exists a multitude of methods for the               this method on the text of the well-known Harry
extraction of character networks (Labatut and Bost,             Potter series (Rowling, 1997).
2019), these often provide networks with unla-                     The remainder of this paper is structured as fol-
belled edges, that is, no information about the kind            lows: After giving a short introduction, we next
of relationship the characters share. Following Kim             present related work. In chapter 3 we describe our
and Klinger (2019), we work towards the goal of                 approaches towards emotion and relation classifica-
detecting the polarity of relations using sentiment             tion as well as our results. We conclude this paper
analysis. To this end, we collect all chunks of text            with a discussion of results and some possible di-
in a novel mentioning a pair of characters and per-             rections for future work
form sentiment analysis on these pieces of text.
While methods for sentiment analysis perform very               2   Related Work
well for certain domains, mostly short texts like
                                                                Our work is situated at the intersection of sentiment
Copyright c 2020 for this paper by its authors. Use permitted
under Creative Commons License Attribution 4.0 Interna-
                                                                analysis and social network extraction.
tional (CC BY 4.0)                                                Character networks for works of fiction have
Figure 1: Trajectory of emotions for different character pairs in Harry Potter as detected by our system. The points
where Harry and Ron/Hermione become friends are clearly visible. Details are discussed in Section 4. The x axis
corresponds to chapters in the books, with book 3 having more chapters than book 1 and thus a longer trajectory.


been studied extensively in recent years (Labatut           and 46 % for 5 basic emotions in the story-level
and Bost, 2019). Some work has been done                    evaluation as described below. We extend this work
on extracting networks from textual summaries               by improving the sentiment analysis model and ag-
(Chaturvedi et al., 2016; Srivastava et al., 2016)          gregate the instance-level labels for full novels.
and training large neural networks to specifically
model relationships over time (Iyyer et al., 2016).         3       Classifying Emotional Relations
While Harry Potter novels have been explored be-            We address two tasks in this paper: mention-level
fore (Vilares and Gómez-Rodrıguez, 2019; Everton           emotion classification and story-level relation clas-
et al., 2019), research has not yet concentrated on         sification, which we see as two steps in a pipeline.
emotional relations between characters.
   For sentiment analysis, most work has focused            Emotion Classification Following Kim and
on short, self-contained texts like tweets (Islam           Klinger (2019), we define emotion classification as
et al., 2019; Rosenthal et al., 2017) or reviews            learning a classifier that, given a short piece of text
(Maas et al., 2011; Xue et al., 2020; Socher et al.,        (roughly one sentence) containing two characters,
2013). Sentiment analysis in fictional texts has be-        predicts the emotion described therein. We perform
come a topic of interest, but has so far proven diffi-      this task on different granularity levels, using either
cult because of the lack of suitable datasets. Kim          2, 5 or 8 directed or undirected emotions.
and Klinger (2018) provide an extensive overview
                                                            Relation Classification We define relation clas-
of papers addressing the issue of sentiment analysis
                                                            sification as an aggregation of emotions discovered
in fictional texts, also addressing papers that use
                                                            by step 1 over a novel. In this paper, we distinguish
emotions in the context of social network extrac-
                                                            between “friendly” and “unfriendly” relations.
tion.
   However, most of these works employ rather               3.1      Method
simple sentiment analysis methods (e.g., Zehe et al.
                                                            Emotion Classification We use a pretrained
(2016) rely on a simple lookup in a sentiment lexi-
                                                            BERT-model (Devlin et al., 2019), which we fine-
con). Most similar to our work is Kim and Klinger
                                                            tune to our task using the fast-bert library1 , mostly
(2019), which we directly build upon. The authors
                                                            keeping the default parameters. We train for 6 (2-,
propose a new corpus of short pieces of text an-
                                                            5-class) or 12 (8-class) epochs with batch size 1.
notated with the emotional relations between char-
acters described in these texts. They train a GRU           Relation Classification We extract all interac-
(Cho et al., 2014) neural network to predict the            tions from a novel mentioning a pair of characters
emotions based on this corpus, showing promising                1
                                                               https://github.com/kaushaltrivedi/
results with F1 scores up to 67 % for undirected bi-        fast-bert,  based on https://github.com/
nary classification (positive and negative emotions)        huggingface/transformers
a, b, classify the emotions described therein and ag-           Novel    #friendly    #unfriendly     #disagree
gregate them to an overall label. We use BookNLP
                                                                HP1             64              30          2/0
(Bamman et al., 2014) to perform co-reference res-
                                                                HP2             61              29          3/0
olution and extract all interactions where both a
                                                                HP3             62              26          7/4
and b each appear at least 20 times in the novel.
                                                                HP4            233              36         22/0
We define an interaction as a chunk of text where a
                                                                HP5            144              57         19/0
and b appear with no more than 10 tokens between
                                                                HP6            107              38         18/0
them, regardless of sentence boundaries, with 10
                                                                HP7            115              44         27/0
additional tokens on both sides as context. We se-
lect only pairs where at least 5 interactions occur in
                                                             Table 1: Character relations in Harry Potter. Middle
the novel and classify the emotions in each of these         columns show friendly and unfriendly relations, respec-
interactions using our BERT-based classifier. For            tively. Last column shows relations where a tie-breaker
the aggregation of emotions to an overall relation,          was used/no agreement could be reached.
we count the number of positive, negative, neutral
and overall emotions (Xa,b ) between a and b and
calculate their difference, classifying relations as         Section 4). Table 1 provides details for the resulting
                                                             dataset, which we publish for future research.3
                    (                       pos
                                              a,b
                     friendly       if α < alla,b
      rel(a, b) =                           pos
                                                             3.3   Evaluation
                                              a,b
                     unfriendly     if α ≥ alla,b .
                                                             Emotion Classification We follow the evalua-
                                                             tion setup from Kim and Klinger (2019) for emo-
The amount α of positive emotions required for a
                                                             tion classification, who use multiple settings: The
friendly relationship is a hyper-parameter.
                                                             dataset (cf. section 3.2) provides annotations for
3.2   Datasets                                               sets of two, five and eight directed or undirected
                                                             emotions. Additionally, they define different ways
Emotion Classification For the first task, we use
                                                             of representing the entities involved in the emo-
the dataset provided by Kim and Klinger (2019)
                                                             tions, where some add a marker to entities or com-
and refer to this paper for a detailed description due
                                                             pletely mask them (making it impossible for the
to space constraints. The dataset consists of 1335
                                                             model to learn that, e.g., Harry always interacts
samples2 , each annotated according to multiple
                                                             positively with Ron). We describe these schemes
schemes. These schemes differ in the number of
                                                             shortly in the following and give an example for
emotions that are annotated (two, five or eight) and
                                                             how sentences would be represented according to
whether the emotions are directed (from a causing
                                                             each scheme:
to an experiencing character) or undirected.
Relation Classification For the second task, we                • No-indicator:  Entities are represented
have collected our own dataset. To this end, we                  as in the text, the model is directly
used BookNLP on all books from the Harry Potter                  fed the unmodified sentence (e.g.,
series to extract all interactions as described in Sec-          Alice is angry with Bob).
tion 3.1. In contrast to the first dataset, we use au-
                                                               • Role: Entities are marked as causing or experi-
tomatically extracted characters and co-references
                                                                 encing (e.g., <e>Alice</e> is angry
here. We then manually annotated all pairs of char-
                                                                 with <c>Bob</c>), where <e> marks
acters for which we found interactions with their
                                                                 the experiencing character and <c> the caus-
relationship, distinguishing between friendly and
                                                                 ing character.
unfriendly relationships. We collected two sets of
independent annotations and, where the two anno-               • MRole: Entities are only identified by their
tators disagreed, collected a third annotation as a              role (<e> is angry with <c>), <e>
tie-breaker. The tie-breaker was given the option                and <c> as above.
to note that there is no (clear) relation between the
two characters. This was the case in the third novel           • Entity: Entities are marked as entities with no
for the relation between Harry and Sirius Black (cf.             indication as to whether they cause or experi-
   2                                                           3
     1742 overall, but following Kim and Klinger (2019) we       http://professor-x.de/datasets/
use only the subset annotated with a causing character       harrymotions.
      ence the emotion (e.g., <et>Alice</et>              fewer samples. b) BERT is a bi-directional model,
      is angry with <et>Bob</et>).                        while the GRU used here is uni-directional. Since
                                                          the GRU reads sentences in the right order, while
    • MEntity: Entities are masked by entity-             BERT reads in both directions, it might be easier
      markers (e.g., <et> is angry with                   for the GRU to model directed relations.
      <et>).
                                                          Dataset Collection As mentioned in Section 3.2,
   Table 2 shows our results in comparison to those       we excluded some relations during the annotation
from Kim and Klinger (2019), reporting what they          process. This is due to two reasons: a) errors in
define as story-level F1 score. Our classifier out-       named entity recognition and b) changing relation-
performs theirs in most settings, as discussed in         ships. For the first category, BookNLP returned
Section 4.                                                the entity “Felix Felicis”, which is a luck potion.
Relation Classification In our second experi-             We excluded all relationships involving the potion,
ment, we use the emotions detected in the previous        but kept collective entities like ”Hogwarts”. In the
step to detect overall relationships between charac-      second category we find the relationship between
ters in the Harry Potter series by aggregating over       Sirius Black and most other characters in the third
emotions as described in Section 3.1. In Table 3,         novel. For the majority of the book, Sirius is re-
we report macro-averaged F1-scores as well as ac-         garded as a villain intent on killing Harry, which is
curacies for aggregating emotions as classified in        revealed to be wrong at the end of the novel, turn-
the Entity and MEntity settings for 2 and 5 emotion       ing the relation very positive. Since the label here
classes, since we do not have role labels for the         is unclear, we excluded it from the dataset.
Harry Potter corpus and the emotion classification        Developing Relations As described before, re-
for 8 emotions did not perform well. Note that the        lationships can change drastically within a novel.
number of emotions only pertains to the emotion           Two prominent examples of this in the Harry Pot-
classification setting, relations are always classified   ter novels are the relations between Harry and
as friendly or unfriendly. For the 5 class setting, we    Hermione in the first novel (where they become
define anger, disgust and sadness as negative emo-        friends) and between Harry and Sirius Black in
tions, joy as positive and anticipation as neutral.       the third novel (see prev. paragraph). We can use
The parameter α was optimised on hp1 and is set           the emotions detected by our classifier to plot a
to 0.4, except for 5-MEntity (α = 0.75). Lacking a        trajectory over the novel. The polarity for char-
directly comparable approach, we report sampling          acters a and b in chapter i is then calculated as
from the true label distribution per novel as a base-     pi = pi−1 + posa,b,i − nega,b,i , where pos/nega,b,i
line (which performs better than majority vote in         counts positive/negative emotions between a and
our setting). We find that, on average, 2 classes         b in chapter i, respectively, and p0 := 0. We show
lead to better results and we always outperform the       plots for three examples in Figure 1, using predic-
baseline.                                                 tions from the 2-class MEntity classification. In
4    Discussion                                           all cases, the trajectory matches our expectation:
                                                          For Harry and Hermione, the relation starts neu-
In this section, we discuss our findings along with       tral with a very clear upper trend after they become
some of the decisions involved in the dataset col-        friends. For Ron, the relation quickly becomes very
lection and provide some insight regarding the de-        positive. For Sirius, the relation is mostly negative,
velopment of emotions over the course of a novel.         while improving clearly in the final chapters.
BERT vs. GRU Our BERT-based classifier out-               5   Conclusion
performs the GRU in all undirected, but not all
directed settings. Specifically, in the 8-class di-       We have presented an improved approach for the
rected evaluation, the GRU usually performs better        classification of emotional relations between fic-
than BERT. We hypothesise two possible reasons:           tional characters. By aggregating sentence level
a) the rather low amount of training data available       emotions, we have built a classifier for novel-
for each of the 8 emotion classes, especially in the      wide character relations based on emotion anal-
directed case. We assume that the GRU’s lower             ysis. While our experiments show that aggrega-
number of parameters makes it easier to tune on           tion yields promising results, future work includes
                                         GRU                                        BERT
                               Undirected    Directed                      Undirected    Directed
                 Setting      8c 5c 2c 8c 5c 2c                           8c 5c 2c 8c 5c 2c
                 NoInd        33   41      66        25    23   37        34    52   74    21     29      41
                 Role         19   34      55        33    35   56        19    51   65    21     23      34
                 MRole        32   44      67        39    44   65        39    59   75    30     55      75
                 Entity       21   31      57        22    18   30        28    44   70    18     30      46
                 MEntity      33   46      65        28    30   39        34    55   74    31     36      48

Table 2: Comparison of story-avg F1-scores between our classifier (BERT) and the GRU from Kim and Klinger
(2019). Results for the GRU are taken from the original paper. The best result for each setting is marked in bold.

                                           F1-score                                  Accuracy
                               2-class           5-class                        2-class         5-class
                   Novel     En    MEn          En        MEn   Base           En    MEn   En      MEn
                   hp1*      53       61        64         60        46        55     64    69         61
                   hp2       57       56        37         52        41        62     59    38         56
                   hp3       62       60        68         56        45        64     61    70         56
                   hp4       60       68        56         55        60        72     77    67         64
                   hp5       62       56        57         62        47        64     58    61         65
                   hp6       60       58        60         60        46        62     62    63         63
                   hp7       58       57        63         49        46        62     60    68         50
                   avg       59       59        58         56        47        63     63    62         59

Table 3: Macro-averaged F1-scores and accuracies for the classification of relations according to different emotion
annotation schemes. En refers to the Entity annotation scheme, MEn to the MEntity scheme. Base refers to the
stratified random baseline. hp1* was used as a development set to determine the value of α.


the development of a stronger classifier for story-                  dynamic relationships between characters in literary
level relations. We also plan on investigating the                   novels. Thirtieth AAAI Conference on Artificial
                                                                     Intelligence.
influence of co-reference resolution, which is cur-
rently done automatically. Using manual labels                  Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-
or improved co-references resolution should fur-                  cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
ther improve our results: First experiments indicate              Schwenk, and Yoshua Bengio. 2014. Learning
                                                                  phrase representations using RNN encoder-decoder
better performance for frequent characters, where                 for statistical machine translation. In Proceedings of
resolution errors are more easily smoothed out.                   the 2014 Conference on Empirical Methods in Natu-
                                                                  ral Language Processing (EMNLP).
Acknowledgements
                                                                Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Many thanks to Darleen Pappelau for helpfully pro-                 Kristina Toutanova. 2019. BERT: Pre-training of
viding the tie breaker annotations for the dataset.                deep bidirectional transformers for language under-
                                                                   standing. In Proceedings of the 2019 Conference
                                                                   of the North American Chapter of the Association
                                                                   for Computational Linguistics: Human Language
References                                                        Technologies, Volume 1 (Long and Short Papers),
David Bamman, Ted Underwood, and Noah A Smith.                     pages 4171–4186, Minneapolis, Minnesota. Associ-
  2014. A bayesian mixed effects model of literary                 ation for Computational Linguistics.
  character. In Proceedings of the 52nd Annual Meet-
  ing of the Association for Computational Linguistics          Sean Everton, Tara Everton, Aaron Green, Cassie Ham-
  (Volume 1: Long Papers), pages 370–379.                         blin, and Rob Schroeder. 2019. Strong ties and
                                                                  where to find them: Or, why Neville (and Ginny and
Snigdha Chaturvedi, Shashank Srivastava, Hal                      Seamus) and Bellatrix (and Lucius) might be more
  Daume III, and Chris Dyer. 2016.   Modeling                     important than Harry and Tom. SSRN.
Lena Hettinger, Martin Becker, Isabella Reger, Fotis        Empirical Methods in Natural Language Processing,
  Jannidis, and Andreas Hotho. 2015. Genre classifi-        pages 1631–1642.
  cation on german novels. In Proceedings of the 12th
  International Workshop on Text-based Information        Shashank Srivastava, Snigdha Chaturvedi, and Tom
  Retrieval.                                                Mitchell. 2016. Inferring interpersonal relations in
                                                            narrative summaries. In Thirtieth AAAI Conference
Jumayel Islam, Robert E. Mercer, and Lu Xiao. 2019.         on Artificial Intelligence.
  Multi-channel convolutional neural network for twit-
  ter emotion and sentiment recognition. In Proceed-      David Vilares and Carlos Gómez-Rodrıguez. 2019.
  ings of the 2019 Conference of the North American         Harry Potter and the action prediction challenge
  Chapter of the Association for Computational Lin-         from natural language. In Proceedings of NAACL-
  guistics: Human Language Technologies, Volume 1           HLT, pages 2124–2130.
  (Long and Short Papers), pages 1355–1365, Min-
  neapolis, Minnesota.                                    Qianming Xue, Wei Zhang, and Hongyuan Zha. 2020.
                                                            Improving domain-adapted sentiment classification
Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jor-          by deep adversarial mutual learning. Accepted to
 dan Boyd-Graber, and Hal Daumé III. 2016. Feud-           appear in AAAI’20.
 ing families and former friends: Unsupervised learn-
                                                          Albin Zehe, Martin Becker, Lena Hettinger, Andreas
 ing for dynamic fictional relationships. In Proceed-
                                                            Hotho, Isabella Reger, and Fotis Jannidis. 2016. Pre-
 ings of the 2016 Conference of the North Ameri-
                                                            diction of happy endings in german novels. In
 can Chapter of the Association for Computational
                                                            Proceedings of the Workshop on Interactions be-
 Linguistics: Human Language Technologies, pages
                                                            tween Data Mining and Natural Language Process-
 1534–1544.
                                                            ing 2016, pages 9–16.
Evgeny Kim and Roman Klinger. 2018. A survey on
  sentiment and emotion analysis for computational
  literary studies. Submitted for review to DHQ
  (http://www.digitalhumanities.org/dhq/).

Evgeny Kim and Roman Klinger. 2019. Frowning
  Frodo, wincing Leia, and a seriously great friend-
  ship: Learning to classify emotional relationships of
  fictional characters. In Proceedings of the 2019 Con-
  ference of the North American Chapter of the Asso-
  ciation for Computational Linguistics: Human Lan-
  guage Technologies, Volume 1 (Long and Short Pa-
  pers), pages 647–653, Minneapolis, Minnesota. As-
  sociation for Computational Linguistics.

Vincent Labatut and Xavier Bost. 2019. Extraction and
  analysis of fictional character networks: A survey.
  ACM Computing Surveys (CSUR), 52(5):1–40.

Andrew L. Maas, Raymond E. Daly, Peter T. Pham,
  Dan Huang, Andrew Y. Ng, and Christopher Potts.
  2011. Learning word vectors for sentiment analysis.
  In Proceedings of the 49th Annual Meeting of the
  Association for Computational Linguistics: Human
  Language Technologies, pages 142–150.

Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.
  SemEval-2017 task 4: Sentiment analysis in twit-
  ter.   In Proceedings of the 11th International
  Workshop on Semantic Evaluation (SemEval-2017),
  pages 502–518.

J. K. Rowling. 1997. Harry Potter and the Philoso-
   pher’s Stone, 1 edition, volume 1. Bloomsbury Pub-
   lishing, London.

Richard Socher, Alex Perelygin, Jean Wu, Jason
  Chuang, Christopher D Manning, Andrew Y Ng,
  and Christopher Potts. 2013. Recursive deep mod-
  els for semantic compositionality over a sentiment
  treebank. In Proceedings of the 2013 Conference on