Distant Viewing of the Harry Potter Movies via
Computer Vision
Alina El-Keilany, Thomas Schmidt and Christian Wolff
Media Informatics Group, University of Regensburg, Germany


                                      Abstract
                                      We present an exploratory study performing distant viewing via computer vision methods in the genre
                                      of fantasy movies. As a case study we use 10 modern fantasy movies of the Harry Potter franchise (also
                                      referred to as Wizarding World franchise). We apply methods and state-of-the-art models for color and
                                      brightness analysis, object detection, location classification as well as facial emotion recognition. We
                                      present descriptive results as well as inference statistics. Furthermore, we discuss the results and the
                                      quality of the methods for this unique use case and give examples. We were able to find significant
                                      differences in our statistical analysis in the results of the methods across the movies with the movies
                                      of the Harry Potter series getting darker and negative emotional expressions on faces becoming more
                                      frequent.

                                      Keywords
                                      computer vision, film studies, distant viewing, harry potter, object detection, emotion recognition


1. Introduction
Digital film analysis has gained a lot of interest and popularity in digital humanities (DH) in
recent years. Although movies are a multimodal medium, research often focuses on one specific
modality. A lot of research uses the text channel as it is more accessible and methods are more
established in DH [1, 2, 3]. However, due to advances in machine learning and computer vision
(CV), scholars have also started investigating the visual image channel of movies, for example, to
analyze shot lengths [4], colors [5, 6, 7], contrast [8] or sentiment [9, 10]. However, current CV
methods offer possibilities beyond basic visual parameters like the method of object detection
which has been used in DH for various tasks [11, 12, 13, 14] and emotion recognition which has
been used in theater studies [15]. To get a larger overview of potential CV methods and tools,
we recommend the survey paper by Pustu-Iren et al. [16].
   To give this research branch a theoretical grounding, Arnold and Tilton [11] defined the term
"distant viewing" for this kind of computational quantitative analysis of movies and other video

The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18,
2022.
$ alina.el-keilany@stud.uni-regensburg.de (A. El-Keilany); thomas.schmidt@ur.de (T. Schmidt);
christian.wolff@ur.de (C. Wolff)
 https://www.uni-regensburg.de/sprache-literatur-kultur/medieninformatik/sekretariat-team/thomas-schmidt/
index.html (T. Schmidt); https://www.uni-regensburg.de/sprache-literatur-kultur/medieninformatik/
sekretariat-team/christian-wolff/index.html (C. Wolff)
 0000-0001-7171-8106 (T. Schmidt); 0000-0001-7278-8595 (C. Wolff)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                       33
material in DH. In this paper, we extend previous work on five case studies [14] and present a
project in the line of distant viewing research for the specific case study of modern "fantasy"
movies, more precisely 10 cinema movies of the Wizarding World (Harry Potter) franchise. We
selected various popular CV methods and applied them on the movies: Color and brightness
analysis, object detection, location classification and emotion recognition. Our approach is
predominantly exploratory. We investigate if these methods uncover certain characteristics of
the movies that can be validated statistically and if we can identify diachronic developments
across the movies with the metrics given by the CV methods (similar to research on websites
by [17]). By doing so, we want to reflect upon the advantages, disadvantages and limitations of
the specific methods for digital film studies and which methods to pursue for further research.


2. Corpus and Preprocessing
The movie corpus for our analysis, consists of ten released movies of the Wizarding World
franchise, consisting of the two subseries Harry Potter and Fantastic Beasts. The Harry Potter
Series is based on J.K. Rowling’s books of the same title, and follows the eponymous Harry Potter,
a student at Hogwarts School of Witchcraft and Wizardry, on his journey of coming of age and
his fight against the main antagonist Voldemort. In 2016 a new series in the Wizarding World
franchise begun with Fantastic Beasts and Where to Find Them and continued with the release
of Fantastic Beasts: The Crimes of Grindelwald in 2018. In general, the movies are prototypical
for the fantasy genre.
   The titles, short titles, and abbreviations (as we use them in this paper), release years, directors
as well as the run times of the movies are shown in table 1. All movies have a frame rate of 25
frames per second with each frame having 32 bits per sample and a 720x576 resolution. The
technical prerequisites of all CV methods are met. As a sample for our analysis we regard one
frame per second of each movie. Therefore, we extracted a single frame for every second of the
movie, keeping its temporal integrity, while reducing the data we process drastically. We do
regard this sample as sufficient and representative of the movies. The number of frames we
effectively worked with is presented in the column “frames” in table 1. Overall, we collected
77,192 frames which we will refer to as the corpus.
   Considering the results, we will first present descriptive data and then inference statistics
via significance tests for the methods we gathered numeric data. As significance test, we
performed a one-way Welch’s ANOVA except for one setting with nominal data for which we
use Pearson’s chi-squared test. Our data meets all necessary requirements for these test. We
speak of significant differences for p < 0.05 and refer to Cohen ([18]) to interpret the effect in
the case of ANOVAs. Cohen defines 𝜂 2 >0.01 as weak, >0.06 as moderate and >0.14 as strong
effect. Furthermore, while we did not perform rigorous systematic evaluations, we will report
upon the general impression about the quality of the methods.


                                                  34
 Title                                           Year      Director      Runtime (mins.)   Frames
 Harry Potter and the Philosopher’s Stone
                                                 2001   Chris Columbus        152            8,293
 (Harry Potter 1; HP1)
 Harry Potter and the Chamber of Secrets
                                                 2002   Chris Columbus        161            8,606
 (Harry Potter 2; HP2)
 Harry Potter and the Prisoner of Azkaban
                                                 2004   Alfonso Cuarón        142            7,465
 (Harry Potter 3; HP3)
 Harry Potter and the Goblet of Fire
                                                 2005    Mike Newell          157            8,270
 (Harry Potter 4; HP4)
 Harry Potter and the Order of the Phoenix
                                                 2007    David Yates          138            7,388
 (Harry Potter 5; HP5)
 Harry Potter and the Half-Blood Prince
                                                 2009    David Yates          153            8,278
 (Harry Potter 6; HP6)
 Harry Potter and the Deathly Hallows – Part 1
                                                 2010    David Yates          146            7,750
 (Harry Potter 7; HP7)
 Harry Potter and the Deathly Hallows – Part 2
                                                 2011    David Yates          130            6,791
 (Harry Potter 8; HP8)
 Fantastic Beasts and Where to Find Them
                                                 2016    David Yates          133            7,147
 (Fantastic Beasts; FB1)
 Fantastic Beasts: The Crimes of Grindelwald
                                                 2018    David Yates          134            7,204
 (Fantastic Beasts 2; FB2)
Table 1
General information on the movie corpus. We extracted one frame per second for each movie.


3. Color Analysis
3.1. Approach
We analyzed the movies’ visual parameters color and brightness using OpenCV [19]. For the
color analysis, we focus on "movie barcodes", a method already applied in color analysis for
digital film studies [5, 6, 7]. To get an average color value for each frame, we imported the
frames as arrays of RGB-values and calculated a mean value for all three color-channels over
all pixels. These mean color values extracted per frame can be utilized to visualize the movies
by generating a so-called "movie barcode", in which each frame is represented by a vertical line
of its mean color [5]. The barcodes can be used to view the movies from a distance and let us
perceive the diachronic progression of colors across a movie and multiple movies.

3.2. Results
The movie barcodes show significant artistic scenes considering color usage (fig. 1): For example,
the light blue strips in the middle of HP3 consist of scenes playing in winter; the large field of
light in the otherwise rather dark HP8 is due to a specific scene in which Harry Potter spends
time in a state of limbo. The movie barcodes show a lot of warm browns and beige tones in the
first two Harry Potter movies as well as larger areas of dark blue, green and cyan colors in HP3.
Overall, the movies tend to get darker and less colorful which is in line with the plot of the
movies getting more serious and less light-hearted. Reflecting upon the benefits of this method,
we conclude that movie barcodes do offer an interesting analysis method for the overall style


                                                 35
Figure 1: "Movie barcodes" for all movies (HP1-HP8, FB1-FB2, from top to bottom).


and presentation of a movie. However, the limitation is that the analysis is done in a rather
qualitative way consisting of interpretation of the barcodes which is always a process that is
prone to subjectivity.


                                                36
                             movie       mean     SD     median      max
                              HP1         0.19    0.12    0.17       0.91
                              HP2         0.15    0.08    0.13       0.98
                              HP3         0.16    0.12    0.13       0.96
                              HP4         0.13    0.09    0.11       0.85
                              HP5         0.12    0.09    0.10       0.97
                              HP6         0.10    0.09    0.07       0.87
                              HP7         0.10    0.08    0.07       0.81
                              HP8         0.13    0.16    0.07       0.96
                              FB1         0.15    0.10    0.13       0.97
                              FB2         0.14    0.10    0.12       0.94
                             Overall      0.14    0.11    0.11       0.98
Table 2
Descriptive statistics for brightness for each movie and overall. Mean is the average across all frames,
SD is the standard deviation, max the maximum.


4. Brightness Analysis
4.1. Approach
We calculated the brightness value for each frame by converting it to a grayscale image and
then calculating the mean value over all pixels representing the image’s brightness on a scale of
0 to 1 (with 0 being a solid black and 1 being a solid white image).

4.2. Results
Table 2 summarizes the statistics for the brightness values. The highest brightness value can be
found for HP1 (M = 0.19) and the lowest for HP7 (M = 0.10). Indeed the brightness becomes
consistently lower throughout the series. The Fantastic Beasts movies are of average brightness.
We performed a one-way Welch’s ANOVA to assess the significance of the difference among the
movies. We did receive a significant result (F = 680.21, p < 0.001). Post-hoc tests (using Holm
correction to adjust p) showed the largest effects regarding the difference between HP1 and
HP6 (𝜂 2 = 0.15), and HP1 and HP7 (𝜂 2 = 0.16), which are large effects according to Cohen ([18]).
   These results are in line with the plot of the movie becoming more serious and darker. Thus,
with the brightness analysis and the inference statistics we show that more recent movies differ
significantly to the older movies considering this metric although the absolute values are rather
similar. Subsequently, we see brightness analysis as a beneficial method for digital film studies.
However, it is hard to point to specific scenes and frames since the summarized value is an
overall calculation over all frames of a movie. Nevertheless, we can also look for maximum
values to find interesting stylistic scenes and frames for in-depth analysis (e.g. fig 2).


                                                  37
Figure 2: Frame with the highest brightness value in HP3. (0.94)


5. Object Detection
5.1. Approach
Object detection is the task to predict object classes and their positions in images. We performed
object detection with the Detectron2 API1 [20] which is regarded as state-of-the-art for object
detection. We used a mask-RCNN model pretrained on the well-known COCO dataset [21]. The
model can predict 80 common everyday objects like cars, animals or furniture. The predictor
takes frames as input and delivers the detected object, its respective location mask, and the
confidence of the prediction on a scale of 0 to 1. We set the threshold for the confidence score
to 0.5 for a prediction. This rather low value allows for an exploratory assessment of the results,
while cutting off the model’s too uncertain predictions. To compare the movies regarding
the objects occurring in them, we counted the objects for every frame and summed up the
total number of occurrences for each object over all frames. Additionally, we calculated the
percentage of frames an object is detected in.

5.2. Results
To analyze the results of this method we focus on frequency distributions across movies. Tables
3, 4 and 5 illustrate the 10 most frequently detected objects for each movie and overall. The
overall impression is that the distributions are rather homogeneous. Persons are the most
frequent objects in all movies by a wide margin which is likely a general characteristic of movies
(fig. 3). Other common objects are ties (as they are part of the school uniforms), chairs and
books. Objects that uncover specific characteristics of the movies are rare except for the suitcase
object in FB1 (see table 5). This object does appear in a high frequency for this movie since it is
an important part of the protagonist and the plot in general.
   We did not perform exact evaluations but we analyzed the detection results heuristically
by scanning through multiple examples across all movies. We gained the impression that the
person detection and the detection of furniture does work quite accurate. However, we did
   1
       https://github.com/facebookresearch/detectron2


                                                        38
Figure 3: Frame with the most frequent persons as determined by the object detection (HP2).


identify problems with objects in the movies that are not part of the COCO-class set. For
example, many of the detected animals are actually fantasy creatures for which (of course) no
predefined class is set in the used model (fig. 4). On the other hand, we also identified false
classifications for objects that are in the model but actually not part of the movies like wands
being classified as smartphones. While all these problems are understandable, we conclude from
this that the method of object detection has its greatest potential when adapting the models
to the unique domain of a movie genre so that the model does deal with the objects that are
important for the specific genre.


Figure 4: Frame with a fantasy creature "falsely" classified as cat by the object detection (FB1).


                                                  39
          Harry Potter 1                 Harry Potter 2                  Harry Potter 3               Harry Potter 4
   object           #      %       object         #       %       object          #       %     object        #        %
    person       23,138 88.0%      person       23,839 87.5%       person       22,842 82.1%    person      31,750 83.8%
       tie        2,941 19.5%         tie       3,508 25.8%           tie       2,536 15.6%        tie      2,693 17.3%
     chair         824   6.6%       book        2,459   4.2%        chair       1,140   9.7%     chair       598     5.7%
     book          820   2.3%       chair       1,050   8.6%        book         586    3.8%     book        307     1.9%
      cup          492   3.3%       vase         356    3.2%       bottle        498    4.3%   handbag       299     3.3%
 dining table      297   2.7%        cup         325    2.8%         bird        430    4.3%     bottle      244     1.8%
      bird         284   2.1%        dog         298    3.3%    dining table     430    4.1%     horse       219     2.4%
    horse          284   3.2%      bottle        230    2.1%         cup         396    3.9%      cup        201     1.7%
  wine glass       245   2.1%     handbag        223    2.4%       horse         320    0.3%      dog        195     2.3%
     vase          211   2.2%   dining table     218    2.1%     wine glass      302    2.4%   wine glass    193     1.7%

Table 3
Distribution of top 10 detected objects for each movie (part 1). # is the absolute frequency for this object
class. % is the percentage of frames containing the specific object at least once.

          Harry Potter 5                 Harry Potter 6                  Harry Potter 7               Harry Potter 8
    object          #      %      object          #       %       object          #       %     object        #        %
    person       25,984 87.9%      person       19,377 82.3%       person       17,762 84.9%    person      21,739 83.7%
       tie        3,491 23.5%       book        1,716   4.3%        chair       1,612   9.6%       tie      1,315    8.5%
     chair        1,026  9.5%         tie       1,637 14.1%         book        1,257   3.2%     chair       328     3.9%
      cup          749   5.6%        cup        1,210   6.5%          tie        906    9.2%     book        207     1.4%
    bottle         644   3.5%       chair       1,144   9.2%    dining table     479    3.3%   handbag       155     2.2%
     book          326   3.4%       bowl         634    4.0%       bottle        338    2.5%     bottle      148     1.3%
   handbag         315   3.4%    wine glass      603    4.3%         cup         274    2.6%     horse       135     1.7%
 dining table      302   3.3%   dining table     589    5.4%         bird        265    1.0%      cup        112     1.6%
     vase          301   3.3%       vase         466    4.6%     wine glass      252    1.7%      dog        108     1.2%
  wine glass       269   2.6%      bottle        442    3.9%         car         229    1.3%   wine glass     90     0.7%

Table 4
Distribution of top 10 detected objects for each movie (part 2).

                   Fantastic Beasts 1               Fantastic Beasts 2                    Overall
               object          #        %       object          #        %       object        #        %
               person       21,070 87.6%        person       20,961 86.2%        person     228,462   85.4%
                  tie        3,850 35.6%           tie        3,611 29.6%           tie      26,488   19.8%
                chair         947    10.2%       chair        1,730 11.7%         chair      10,399   8.4%
                book          713     4.1%       book          634     3.2%       book       9,025    3.2%
                 cup          456     3.8%      bottle         390     3.0%        cup       4,481    3.4%
              handbag         297     3.6%   dining table      263     2.4%      bottle      3,295    2.5%
              suitcase        277     3.4%        cup          230     2.5%   dining table   3,021    2.9%
               bottle         270     2.1%     handbag         227     2.8%    wine glass    2,469    2.1%
             wine glass       220     1.6%       Dog           221     2.7%       vase       2,317    2.5%
            dining table      200     2.5%       Vase          209     2.4%     handbag      2,295    2.6%

Table 5
Distribution of top 10 detected objects for each movie (part 3) and overall.


6. Location Classification
6.1. Approach
Location classification (also often called place or scene classification) does not refer to the
geographical location of an image but the overall setting which an image depicts, e.g. a forest,


                                                           40
                                        movie      indoor   outdoor
                                         HP1        78.9%    21.1%
                                         HP2        84.2%    15.8%
                                         HP3        60.5%    39.5%
                                         HP4        75.1%    24.9%
                                         HP5        82.8%    17.2%
                                         HP6        85.2%    14.8%
                                         HP7        70.2%    29.8%
                                         HP8        75.4%    24.6%
                                         FB1        73.7%    26.3%
                                         FB2        74.8%    25.2%
                                        Overall     76.3%    23.7%
Table 6
Distribution of frames classified as predominantly indoor or outdoor for each movie and overall.


an indoor-room, a street etc. To detect locations and the setting of a scene, we used places3652 ,
which offers a residual neural network (ResNet) pretrained on the Places23 dataset [22]. The
ResNet can predict 365 location categories, including rather exotic ones like "airfields" or "zen
gardens" based on what the overall image resembles the most. The 365 classes are structured
in a hierarchical order summing up to differ between indoor and outdoor on the highest level.
Using the model on preprocessed images yields the most likely location as well as the prediction
confidence on a scale of 0 to 1. The default mode of the location classifier is assigning every
image with the most likely location, but the probabilities of these predictions are often very
low. Therefore, we introduced a threshold of 0.7 to keep only rather certain predictions of the
model. This resulted in 14,263 classified frames (18.5% of all frames). For each movie we summed
up the number of times the location is predicted and calculated the percentage of frames it is
detected in. Additionally, we categorized each frame into the groups indoor and outdoor, using
the model’s 5 most likely predictions and majority voting.

6.2. Results
First, table 6 presents the distribution of frames classified as rather indoor and outdoor for all
movies. We can consistently identify that the majority of frames across all movies are classified
as indoor. This is in line with the content of the movies that usually take place inside of a castle.
We performed a Pearson’s chi-squared test, which showed significant differences between the
movies (𝜒2 = 243.9, p < 0.001). The effect size measured by Cramér’s V (0.13) shows a weak
effect ([18]). We can see that the indoor-percentage decreases for the last two movies which
makes sense plot-wise since the main characters travel throughout the movies. However, the
large outdoor-percentage for HP3 is mostly due to misclassifications. This movie is shot with a
lot of blue lightning and effects due to artistic reasons which are constantly misclassified as
underwater (see fig. 5).
   Table 7, 8 and 9 illustrate the distribution of the subcategories across all movies. Similar to
    2
        https://github.com/CSAILVision/places365
    3
        http://places2.csail.mit.edu/


                                                     41
Figure 5: Frame falsely classified as underwater (HP3).

           Harry Potter 1                   Harry Potter 2                Harry Potter 3                Harry Potter 4
      location        #        %       location       #      %       location       #      %       location       #      %
       jail cell     235    21.1%       jail cell   498 37.8%         jail cell   557 50.1%         jail cell   375 27.7%
     catacomb        229    20.6%     catacomb      291 22.1%       catacomb       68    6.1%      catacomb     348 25.7%
   nursing home       62     5.6%       archive      72    5.5%   elevator shaft   65    5.8%      aquarium     116 8.6%
     aquarium         55     4.9%    pub/ indoor     63    4.8%    ocean deep      50    4.5%     ocean deep     73    5.4%
   stage/ indoor      45     4.0%   elevator shaft   51    3.9%     aquarium       41    3.7%    discotheque     59    4.4%
      staircase       41     3.7%     aquarium       31    2.4%         sky        37    3.3%   elevator shaft 40      3.0%
   elevator shaft     39     3.5%     bookstore      29    2.2%   train interior   24    2.2%         sky        39    2.9%
 conference center 36        3.2%         sky        21    1.6%      staircase     21    1.9%     auditorium     20    1.5%
       archive        23     2.1%   hospital room 20       1.5%   hospital room 20       1.8%    throne room     18    1.3%
         sky          22     2.0%        Slum        15    1.1%      crevasse      18    1.6%      staircase     18    1.3%

Table 7
Distribution of top 10 detected locations for each movie (part 1).


the object detection, the distribution is overall homogeneous. However, the detected classes
are often rather exotic. The frequent jail classifications are surprising. While some scenes do
play in jails, most of these classifications are due to the lattice-like windows in the Hogwarts
castle in which most of the movies take place (see fig. 6). While many classifications are
understandable, the method suffers from the fact that the model is trained for the classification
of nature photographs and not for movies. Close shots pose a lot of challenges to the model due
to the missing surroundings and landscapes. In future work we intend to segment these shots
from wide shots including landscapes to focus on the rather correctly classified frames.


7. Emotion Recognition
7.1. Approach
Emotion recognition is the method to detect emotions on human faces and employed in various
use cases in computer science ([23, 24, 25, 26] but, to the best of our knowledge, rarely on the
image channel in DH [15] but predominantly on text, e.g. plays [27, 28, 29] or social media


                                                           42
             Harry Potter 5                       Harry Potter 6                 Harry Potter 7                     Harry Potter 8
        location            #       %       location        #      %        location       #      %            location             #       %
         jail cell         374   31.8%     catacomb       1,113 51.1%        jail cell    515 34.7%             jail cell          858   51.1%
      discotheque          149   12.7%       jail cell     762   35.0%     catacomb       396 26.7%           catacomb             518   30.9%
        catacomb           125   10.6%       archive       61    2.8%      basement       113 7.6%          elevator shaft         92     5.5%
       pub/indoor           76    6.5%   elevator shaft    52    2.4%    bamboo forest 78       5.3%       church/ indoor          34     2.0%
        aquarium            71    6.0%          sky        30    1.4%    elevator shaft   45    3.0%          aquarium             23     1.4%
     elevator shaft         53    4.5%         alley       17    0.8%          sky        33    2.2%              sky              19     1.1%
      stage/indoor          44    3.7%         igloo       15    0.7%       campsite      32    2.2%           staircase           14     0.8%
 underwater/ ocean deep 30        2.6%     aquarium        14    0.6%    elevator/ door   26    1.8%      escalator/ indoor        12     0.7%
         medina             28    2.4%        stable       12    0.6%     wheat field     21    1.4%   subway station/ platform 11        0.7%
       playground           24    2.0%     cemetery        11    0.5%         alley       20    1.3%          basement              9     0.5%

Table 8
Distribution of top 10 detected locations for each movie (part 2).

                     Fantastic Beasts 1                Fantastic Beasts 1                            Overall
                  location       #       %          location       #        %             location             #         %
                   jail cell    487 39.8%            jail cell    914 56.0%                jail cell         5,575     39.1%
                 catacomb       142 11.6%          catacomb       210 12.9%              catacomb            3,440     24.1%
               bamboo forest 55        4.5%            sky         60     3.7%         elevator shaft         548      3.8%
               elevator shaft    54    4.4%      elevator shaft    57     3.5%           aquarium             435      3.0%
                pub/indoor       41    3.3%        aquarium        39     2.4%               sky              309      2.2%
                    sauna        32    2.6%          medina        39     2.4%          discotheque           272      1.9%
                 bank vault      30    2.5%           igloo        33     2.0%          pub/indoor            247      1.7%
                 aquarium        28    2.3%       throne room      26     1.6%             archive            175      1.2%
                      sky        27    2.2%     burial chamber 24         1.5%     underwater/ocean deep      171      1.2%
                     igloo       26    2.1%         crevasse       18     1.1%            crevasse            20       1.3%

Table 9
Distribution of top 10 detected locations for each movie (part 3) and overall.


Figure 6: Frame falsely classified as jail cell (HP8).


content [30, 31]. We used the Python module FER4 [32] to recognize the characters‘ emotions.
In a first step, the faces must be detected. We used a multitask cascaded convolutional networks
(MTCNN; [33]) and the Haar Cascade facial recognition algorithm proposed by Viola and
Jones [34]. For the emotion analysis, we used a CNN trained on the FER-2013[32] data set
that can predict the seven emotional categories anger, disgust, fear, happiness, neutral, sadness

    4
        https://pypi.org/project/fer;https://github.com/justinshenk/fer


                                                                    43
                    HP1              HP2              HP3              HP4              HP5                HP6
  emotion     mean        %    mean        %    mean        %    mean        %    mean        %      mean        %
    angry      0.16    11.4%    0.16    10.2%    0.17    12.7%    0.21    18.6%    0.15     8.5%      0.19    14.2%
   disgust     0.00     0.1%    0.00     0.1%    0.00     0.0%    0.00     0.0%    0.00     0.0%      0.00     0.0%
     fear      0.11     5.1%    0.11     4.7%    0.10     2.9%    0.10     2.5%    0.10     2.5%      0.09     2.0%
    happy      0.10     9.0%    0.09     7.9%    0.11     7.8%    0.12    10.1%    0.10     8.1%      0.09     8.2%
   neutral     0.23    24.3%    0.25    27.8%    0.24    27.2%    0.18    15.2%    0.27    31.4%      0.24    24.5%
     sad       0.29    41.2%    0.30    43.2%    0.31    45.3%    0.32    49.2%    0.31    46.6%      0.32    47.6%
   surprise    0.10     8.8%    0.08     6.1%    0.06     4.1%    0.07     4.4%    0.06     3.0%      0.07     3.4%

Table 10
Results for the emotion recognition across all movies (part 1). Mean is the average of this emotion across
all frames (with detected faces), % is the proportion of frames with this specific emotion as maximum
value across all these frames.

                      HP7                 HP8                 FB1                 FB1                 Overall
  emotion       mean        %       mean        %       mean        %       mean        %          mean      %
    angry        0.18    10.9%       0.21    17.5%       0.18    14.0%       0.19    15.8%          0.18  12.9%
   disgust       0.00     0.0%       0.01     0.4%       0.00     0.0%       0.00     0.0%          0.00   0.1%
     fear        0.09     2.0%       0.10     3.9%       0.12     5.5%       0.11     3.3%          0.10   3.6%
    happy        0.07     4.7%       0.08     5.1%       0.09     7.4%       0.07     5.9%          0.09   7.5%
   neutral       0.20    17.6%       0.20    19.2%       0.23    24.9%       0.21    21.9%          0.23  24.1%
     sad         0.37    59.7%       0.34    50.7%       0.30    42.7%       0.33    47.9%          0.32  46.9%
   surprise      0.08     5.1%       0.07     3.3%       0.08     5.5%       0.08     5.1%          0.08   5.0%
Table 11
Results for the emotion recognition across all movies (part 2) and overall.


and surprise. For every face multiple emotions can be predicted simultaneously in varying
percentages, summing up to 1. If more than one face was detected in a frame, we calculated
a mean value for the emotions. Additionally, we assigned the highest scoring emotion as the
dominant emotion for a frame, which allows us to explore what the most dominant emotion is
for every movie.

7.2. Results
For the statistical analysis, we averaged the means for all frames of a movie on which at least
one face is detected to get an overall value. Furthermore, we calculated the percentage of frames
having a specific emotion as maximum value of all emotions (on the same set of frames). In
tables 10 and 11, we present the results.
   The generally low mean values are due to the fact that the emotion classes often have a
value of 0 since the values do need to sum up to 1. Overall, we identified sadness as the most
frequently detected emotion (46.9%) (see fig. 7 for an example) followed by neutral (24.1%) and
anger (12.9%). The sadness value increases up to HP8 reaching the maximum in HP7 (59.7%)
while the happy value decreases. Again, this points to the increased dramatic seriousness in the
plot throughout the movies. The emotion disgust was rarely detected. We performed a Welch’s
ANOVA test and found that, indeed, the difference among the movies for each emotion class
is significant (p < 0.001). However, the effect size is rather small for most emotions (𝜂 2 < 0.01)


                                                       44
except for angry with a moderate effect (𝜂 2 = 0.02). Nevertheless, this shows that the movies are
rather homogeneous concerning the emotional tone. Analyzing the results, we found that the
emotion detection generally works quite well. However, the face detection has problems dealing
with faces that are not looking directly towards the camera (fig. 8). This is due to the fact that
the training material of the model predominantly consists of such faces. We conclude for the
face detection that it needs domain adaptation for the complex angles that movies consist of.


Figure 7: Frame with a maximum sad value (HP8).


Figure 8: Frame with a face not looking directly towards the camera, thus not being detected by the
face detection (HP8).


8. Discussion
One of our research goals was to identify if the applied methods can uncover specific charac-
teristics and differences across the movies. Indeed, the color and brightness analysis showed
descriptive differences and in the case of brightness differences that could be supported by
significance tests. The movies of the Harry Potter series tend to get darker. These general visual
results are in line with the results of the emotion analysis which also shows an increase in


                                                45
sadness classifications and a decrease for the average happiness value. However, for most of the
other methods, we found significant differences but with rather low effects. Most of the methods
behaved rather homogeneous across the movies with some punctual exceptions. One reason
for this might be that the movies belong to the same series, franchise and genre. Therefore, the
stylistic and content-based differences might be too small to become apparent via these kind of
methods. We want to explore this assumption in future work by conducting case studies with
movies of different genres and decades.
   We did not perform an exact evaluation. We plan to do so in future work for some of the
methods by systematically evaluating a subset of the corpus against human-made annotation to
get a precise overview about the quality of the methods. However, we did sporadically explore
the quality of the results while conducting our research. While we do think that all of the
methods in many cases work surprisingly well, mistakes and misclassifications are not rare.
Many of these problems are connected to the fantasy genre and the behavior of the model is
understandable. We conclude that this is the general main challenge of the research. All of
the CV methods are not intended for artistic movies and therefore need domain-adaptation
which is possible and has been a common research branch in machine learning in recent years.
However, domain-adaption needs large amounts of correctly annotated frames which is very
resource-intensive and challenging for similar narrative content like plays [35, 36]. Nevertheless,
we intend to further this process by starting annotation studies for one of the most promising
methods, object detection, which we will then use to train and extend general purpose models
for the specific use case of fantasy movies.
   Despite the problems, we could show that many of the methods offer a lot of possibilities for
large-scale distant viewing research in digital film studies. We see a great potential in combining
the methods to explore correlations, for example if certain locations in genre-based movies
appear more frequent with specific objects. At the same time, we also see potential in analyzing
diachronic developments or in comparing different genres via CV methods.


References
 [1] E. Hoyt, K. Ponto, C. Roy, Visualizing and Analyzing the Hollywood Screenplay with
     ScripThreads, Digital Humanities Quarterly 008 (2014).
 [2] A. Hołobut, J. Rybicki, The Stylometry of Film Dialogue: Pros and Pitfalls, Digital
     Humanities Quarterly 014 (2020).
 [3] J. Byszuk, The Voices of Doctor Who – How Stylometry Can be Useful in Revealing New
     Information About TV Series, Digital Humanities Quarterly 014 (2020).
 [4] M. Baxter, D. Khitrova, Y. Tsivian, Exploring cutting structure in film, with applications to
     the films of D. W. Griffith, Mack Sennett, and Charlie Chaplin, Digital Scholarship in the
     Humanities 32 (2017) 1–16. URL: https://doi.org/10.1093/llc/fqv035. doi:10.1093/llc/
     fqv035.
 [5] M. Burghardt, K. Hafner, L. Edel, S.-L. Kenaan, C. Wolff, An information system for the
     analysis of color distributions in moviebarcodes, in: M. Gäde (Ed.), Everything changes,
     everything stays the same? Understanding information spaces : Proc.15th Int. Symp. of
     Information Science (ISI 2017), Berlin, Germany, 13th-15th March 2017, volume 70 of


                                                46
     Schriften zur Informationswissenschaft, Verlag Werner Hülsbusch, Glückstadt, 2017, pp.
     356–358. URL: https://epub.uni-regensburg.de/35682/.
 [6] B. Flueckiger, G. Halter, Methods and Advanced Tools for the Analysis of Film Colors in
     Digital Humanities, Digital Humanities Quarterly 014 (2020).
 [7] N. Redfern, Colour palettes in US film trailers: a comparative analysis of movie barcode,
     Umanistica Digitale (2021) 251–270. URL: https://umanisticadigitale.unibo.it/article/view/
     12468. doi:10.6092/issn.2532-8816/12468, number: 10.
 [8] Pause, Johannes, Walkowski, Niels-Oliver, Dead and Beautiful: The Analysis of Colors by
     Means of Contrasts in Neo-Zombie Movies, Digital Humanities 2017. Conference Abstracts
     (2017).
 [9] T. Schmidt, D. Halbhuber, Live sentiment annotation of movies via arduino and a slider,
     in: Digital Humanities in the Nordic Countries 5th Conference 2020 (DHN 2020). Late
     Breaking Poster., 2020. URL: https://epub.uni-regensburg.de/49300/.
[10] T. Schmidt, I. Engl, D. Halbhuber, C. Wolff, Comparing live sentiment annotation of movies
     via arduino and a slider with textual annotation of subtitles., in: DHN Post-Proceedings,
     2020, pp. 212–223. URL: https://epub.uni-regensburg.de/50811/.
[11] T. Arnold, L. Tilton, Distant viewing: analyzing large visual corpora, Digital Scholarship
     in the Humanities (2019). URL: https://doi.org/10.1093/digitalsh/fqz013. doi:10.1093/
     digitalsh/fqz013.
[12] G. Howanitz, B. Bermeitinger, E. Radisch, S. Gassner, M. Rehbein, S. Handschuh, Deep
     Watching - Towards New Methods of Analyzing Visual Media in Cultural Studies, 2019.
     doi:10.5281/zenodo.3326470.
[13] T. Schmidt, S. Kurek, Der Einsatz von Computer Vision-Methoden für Filme - Eine Fallanal-
     yse für die Kriminalfilm-Reihe Tatort, in: DHd 2022 Kulturen des digitalen Gedächtnisses.
     8. Tagung des Verbands "Digital Humanities im deutschsprachigen Raum" (DHd 2022),
     Potsdam, Germany, 2022. URL: https://zenodo.org/record/6328167. doi:10.5281/zenodo.
     6328167.
[14] T. Schmidt, A. El-Keilany, J. Eger, S. Kurek, Exploring Computer Vision for Film Analysis:
     A Case Study for Five Canonical Movies, in: 2nd International Conference of the European
     Association for Digital Humanities (EADH 2021), Krasnoyarsk, Russia, 2021. URL: https:
     //epub.uni-regensburg.de/50867/. doi:10.5283/epub.50867.
[15] T. Schmidt, C. Wolff, Exploring Multimodal Sentiment Analysis in Plays: A Case Study for
     a Theater Recording of Emilia Galotti, in: Proceedings of the Conference on Computational
     Humanities Research 2021 (CHR 2021), Amsterdam, The Netherlands, 2021, pp. 392–404.
     URL: http://ceur-ws.org/Vol-2989/short_paper45.pdf.
[16] K. Pustu-Iren, J. Sittel, R. Mauer, O. Bulgakowa, R. Ewerth, Automated Visual Content
     Analysis for Film Studies: Current Status and Challenges, Digital Humanities Quarterly
     014 (2020).
[17] T. Schmidt, A. Mosiienko, R. Faber, J. Herzog, C. Wolff, Utilizing html-analysis and computer
     vision on a corpus of website screenshots to investigate design developments on the web,
     Proceedings of the Association for Information Science and Technology 57 (2020) e392. URL:
     https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/pra2.392. doi:10.1002/pra2.392.
[18] J. Cohen, Statistical power analysis for the behavioral sciences, 2nd ed ed., L. Erlbaum
     Associates, Hillsdale, N.J, 1988.


                                               47
[19] G. Bradski, The OpenCV Library., Dr. Dobb’s Journal: Software Tools for the Professional
     Programmer 25 (2000). URL: https://elibrary.ru/item.asp?id=4934581.
[20] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, R. Girshick, Detectron2, 2019. URL: https://github.
     com/facebookresearch/detectron2.
[21] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan,
     C. L. Zitnick, P. Dollár, Microsoft COCO: Common Objects in Context, arXiv:1405.0312
     [cs] (2015). URL: http://arxiv.org/abs/1405.0312, arXiv: 1405.0312.
[22] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 Million Image
     Database for Scene Recognition, IEEE Transactions on Pattern Analysis and Machine
     Intelligence 40 (2018) 1452–1464. URL: https://ieeexplore.ieee.org/document/7968387/.
     doi:10.1109/TPAMI.2017.2723009.
[23] A.-M. Ortloff, L. Güntner, M. Windl, T. Schmidt, M. Kocur, C. Wolff, Sentibooks: Enhancing
     audiobooks via affective computing and smart light bulbs, in: Proceedings of Mensch
     Und Computer 2019, MuC’19, Association for Computing Machinery, New York, NY, USA,
     2019, p. 863–866. URL: https://doi.org/10.1145/3340764.3345368. doi:10.1145/3340764.
     3345368.
[24] D. Halbhuber, J. Fehle, A. Kalus, K. Seitz, M. Kocur, T. Schmidt, C. Wolff, The mood game -
     how to use the player’s affective state in a shoot’em up avoiding frustration and boredom,
     in: Proceedings of Mensch Und Computer 2019, MuC’19, Association for Computing
     Machinery, New York, NY, USA, 2019, p. 867–870. URL: https://doi.org/10.1145/3340764.
     3345369. doi:10.1145/3340764.3345369.
[25] P. Hartl, T. Fischer, A. Hilzenthaler, M. Kocur, T. Schmidt, Audiencear - utilising augmented
     reality and emotion tracking to address fear of speech, in: Proceedings of Mensch Und
     Computer 2019, MuC’19, Association for Computing Machinery, New York, NY, USA,
     2019, p. 913–916. URL: https://doi.org/10.1145/3340764.3345380. doi:10.1145/3340764.
     3345380.
[26] T. Schmidt, M. Schlindwein, K. Lichtner, C. Wolff, Investigating the relationship between
     emotion recognition software and usability metrics, i-com 19 (2020) 139–151. URL: https:
     //doi.org/10.1515/icom-2020-0009. doi:10.1515/icom-2020-0009.
[27] T. Schmidt, K. Dennerlein, C. Wolff, Emotion Classification in German Plays with
     Transformer-based Language Models Pretrained on Historical and Contemporary Lan-
     guage, in: Proceedings of the 5th Joint SIGHUM Workshop on Computational Linguistics
     for Cultural Heritage, Social Sciences, Humanities and Literature, Association for Compu-
     tational Linguistics, Punta Cana, Dominican Republic (online), 2021, pp. 67–79. URL: https:
     //aclanthology.org/2021.latechclfl-1.8. doi:10.18653/v1/2021.latechclfl-1.8.
[28] T. Schmidt, K. Dennerlein, C. Wolff, Using Deep Learning for Emotion Analysis of 18th
     and 19th Century German Plays, in: M. Burghardt, L. Dieckmann, T. Steyer, P. Trilcke,
     N.-O. Walkowski, J. Weis, U. Wuttke (Eds.), Fabrikation von Erkenntnis. Experimente in
     den Digital Humanities, 2021. doi:10.26298/melusina.8f8w-y749-udlf.
[29] T. Schmidt, K. Dennerlein, C. Wolff, Towards a Corpus of Historical German Plays
     with Emotion Annotations, in: D. Gromann, G. Sérasset, T. Declerck, J. P. McCrae,
     J. Gracia, J. Bosque-Gil, F. Bobillo, B. Heinisch (Eds.), 3rd Conference on Language, Data
     and Knowledge (LDK 2021), volume 93 of Open Access Series in Informatics (OASIcs),
     Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 2021, pp. 9:1–9:11.


                                               48
     doi:10.4230/OASIcs.LDK.2021.9.
[30] T. Schmidt, P. Hartl, D. Ramsauer, T. Fischer, A. Hilzenthaler, C. Wolff, Acquisition and
     analysis of a meme corpus to investigate web culture., in: Digital Humanities Conference
     2020 (DH 2020), Ottawa, Canada, 2020. URL: https://epub.uni-regensburg.de/49294/. doi:10.
     17613/mw0s-0805.
[31] T. Schmidt, F. Kaindl, C. Wolff, Distant reading of religious online communities: A case
     study for three religious forums on reddit., in: DHN, Riga, Latvia, 2020, pp. 157–172. URL:
     http://ceur-ws.org/Vol-2612/paper11.pdf.
[32] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski,
     Y. Tang, D. Thaler, D.-H. Lee, Y. Zhou, C. Ramaiah, F. Feng, R. Li, X. Wang, D. Athanasakis,
     J. Shawe-Taylor, M. Milakov, J. Park, R. Ionescu, M. Popescu, C. Grozea, J. Bergstra,
     J. Xie, L. Romaszko, B. Xu, Z. Chuang, Y. Bengio, Challenges in Representation Learning:
     A report on three machine learning contests, arXiv:1307.0414 [cs, stat] (2013). URL:
     http://arxiv.org/abs/1307.0414, arXiv: 1307.0414.
[33] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint Face Detection and Alignment Using Multitask
     Cascaded Convolutional Networks, IEEE Signal Processing Letters 23 (2016) 1499–1503.
     doi:10.1109/LSP.2016.2603342.
[34] P. Viola, M. J. Jones, Robust Real-Time Face Detection, International Journal of Computer
     Vision 57 (2004) 137–154. URL: https://doi.org/10.1023/B:VISI.0000013087.49260.fb. doi:10.
     1023/B:VISI.0000013087.49260.fb.
[35] T. Schmidt, B. Winterl, M. Maul, A. Schark, A. Vlad, C. Wolff, Inter-rater agreement
     and usability: A comparative evaluation of annotation tools for sentiment annotation,
     in: C. Draude, M. Lange, B. Sick (Eds.), INFORMATIK 2019: 50 Jahre Gesellschaft für
     Informatik – Informatik für Gesellschaft (Workshop-Beiträge), Gesellschaft für Informatik
     e.V., Bonn, 2019, pp. 121–133. doi:10.18420/inf2019_ws12.
[36] T. Schmidt, M. Burghardt, K. Dennerlein, C. Wolff, Sentiment annotation for lessing’s
     plays: Towards a language resource for sentiment analysis on german literary texts, in:
     T. Declerck, J. P. McCrae (Eds.), 2nd Conference on Language, Data and Knowledge (LDK
     2019), 2019, pp. 45–50. URL: http://ceur-ws.org/Vol-2402/paper9.pdf.


                                              49