Scene Linking Annotation and Automatic Scene
                   Characterization in TV Series

            Aman Berhe1                         Camille Guinaudeau1                          Claude Barras2
          aman.berhe@limsi.fr                camille.guinaudeau@limsi.fr                   barras@vocapia.com
    Université Paris-Saclay, CNRS, LIMSI, 91400, Orsay, France1               Vocapia Research, 91400, Orsay, France2


                                                         Abstract
                       In the context of a large collection of multimedia documents, creating
                       links between documents or scenes can help to organize the collection.
                       For TV series, this organization can be achieved by means of narrative
                       structure extraction through scene linking. Narrative characteristics
                       such as speaking characters, entity mentions and theme can be used
                       to characterize scenes. The linking of scenes can be between scenes
                       inside an episode, between scenes in di↵erent episodes and/or in
                       di↵erent seasons, since stories in TV series progress at di↵erent level
                       of granularity. In this work, we have annotated the links between
                       the scenes of the first two seasons of the TV series Game of Thrones,
                       using predefined stories and sub stories. We have also automatically
                       extracted the narrative characteristics of each scene. The dataset is
                       composed by 444 scenes, involving 154 speaking character organized
                       in 46 stories divided into 151 sub stories and 5 sub sub-stories.


1     Introduction
Organization of large multimedia collections can be done by means of scenes/documents linking. For example
[BV S + 17] used cross modal approaches to link target videos to an anchor in a collection of archived multimedia
documents and were able to extract common patterns between target video and the anchor. In the context of
TV series, organizing a large and complicated TV series, Game of thrones for example, as a collection of scenes
is a way of understanding its narratives.

   The creation and di↵usion of manual annotations for scene linking, even though difficult and time con-
suming, is necessary to evaluate automatic techniques and allow reproducible research. Some few works like
[GLV14, Pro10, EF19] have been done on narrative structure annotations in folk tales, for example Propp’s
folktales [Pro10] and french folktales [GLV14]. However, to our knowledge, there is no publicly available
annotated dataset of scene linking for narrative structure extraction of TV series or more generally for
multimedia collections.

   In this paper, we present a new dataset composed by the annotation for the two first seasons of Game
of Thrones for scene linking with respect to narrative structure and the annotation of most reportable scene

Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia (eds.): Proceedings of the Text2Story’20 Workshop, Lisbon, Portugal, 14-April-2020,
published at http://ceur-ws.org


                                                             47
(MRS), i.e. scenes that bring a drastic change on the lives of characters and a story change. They are key scenes
to make connection of other scenes. The dataset (annotation + automatic characterization of scenes) is made
publicly available in order to allow reproducible research.

   Our paper is organized as follows. First, we present the background studies and data for narrative TV series
and books in Section 2. Then, Section 3 describes scene linking annotation and its importance for narrative
structure extraction. In Section 4, we introduce scene characterization with respect to narrative elements. Then,
the data description is discussed in Section 5 and we conclude our paper in Section 6.

2   Background
Since 1970s, narratives and story telling has been investigated in validating scientific methods in the field of
artificial intelligence for understanding and evaluating human cognition theories [Var17, AS90]. The field of
computational narrative links the daily human activities (narratives) and the computing world (machines com-
putations) by analyzing and modelling narratives, narrative understanding and machine readable representation
of narratives with a purpose of enabling computers to tell a story.

  Narrative elements annotation of a narrative document is a very time consuming task. [LCW + 17, EF19]
have worked on the annotation of narrative elements of short stories in two di↵erent ways; [LCW + 17]
produced a guide line for directly annotating the narrative structure based on Freytag’s [Fre90] pyramid, and
[EF19] provided a guideline for narrative characteristics annotation to collect human judgements on narrative
characteristics. Garcia-Fernandez et al. [GLV14] proposed digitization and annotation of a tales corpus from
narrative point of view (the only French tales corpus available) and classified it according to the Aarne &
Thompson narrative classification.

   Multimedia hyperlinking – a way to navigate information in videos by jumping from one video to an-
other – has been studied by many people [BGJ + 17, BV S + 17, BDG18, CSR18] using di↵erent techniques.
[BGP + 15, OAB15, KM16, AF M + 16, BGJ + 17] have designed some linking categories or typologies for multi-
media hyperlinking and build graph for easily exploring news or links. Kim et al. [KM16] used narrative theory
as a framework to identify the links between social media content. [OAB15] presented a video hyperlinking
based on named entity identification.

  In TV series, scenes can be linked using the concept of multimedia hyperlinking [KM16, ESB12, BGJ + 17,
CSR18] which can be used to tie di↵erent videos together and recreate one whole narrative.


3   Scene linking annotation
A link, as in scene linking, is the relevance between two or more scenes according to the story and the
narrative elements they share. Linking scenes from the same episode or from di↵erent episodes, and continuing
this chain of links until the last episode of a TV series, can capture the narrative structure of the whole TV series.

   Before, we start scene linking annotation, stories and sub stories of the TV series are defined based on the
main characters’ stories and the story of the overall Game of Thrones TV series. A scene can start by a story
or a sub-story, for example a scene of Jon Snow’s (character in Game of Thrones) story starts with ”Jon Snow
going to the wall” which is a sub-story of ”Jon Snow as Lord Commander”. A scene is assigned to one or more
stories and can have many sub stories inside the assigned stories depending on the theme of the story that the
scene focuses on.

   Our scene linking annotation is performed in two steps. First, current scene is linked to the most related
scenes that come before it. One scene might be linked to more than 1 scene. For example, a scene (S3) may
starts with an event or story that is linked to a scene (S2) and it may also focus on an event or story that is in
another scene (S1). Therefore S3 is linked to both S2 and S1 but there may not be a link between S2 and S1.
Second, a story title is given for each scene, based on the predefined stories and sub-stories. As for the stories,
manually annotated stories can have up to three levels of granularity. The given stories are used as linking
category. A linking category is a title given to link two or more scenes according to a story assigned. When we


                                                       48
think of automatic assignment of story title to a scene, a linking category may be seen as a cluster name that
group related scenes together.

   Furthermore, in each story, we have also annotated the most reportable scenes (MRS), i.e. the scenes that
bring a major change to a story or a situation inside a scene.

4     Semi-automatic scene characterization
The process of scene linking annotation is time consuming and ambiguous in a sense of assigning stories to
a scene. Therefore, an automatic tool that can characterize a scene to make the links between them is vital.
Annotating a scene with narrative characteristics helps to create a link between scenes with respect to narrative
structure specially in case of complicated TV series. Thus we have designed a method to automatically
characterize a scene. The whole process is discussed as follows.

   Before scene characterization, data of each episode need to be prepared. Therefore, video episodes are
extracted from DVDs of the TV series with their respected audio and subtitles in many di↵erent languages
(e.g. French, English, German, Czech and Spanish, Polish and Hungarian). Manual transcripts of episodes are
scraped from di↵erent websites and fan pages of TV series with the speaking character names for each line of
the transcripts. Speaking character names are normalized to the characters’ list found in IMDB1 by adding ” ”
instead of space and change all letters to lower case. Finally, forced-alignment of transcripts is performed using
LIMSI text-to-audio alignment tool [GLA02] with the audio files extracted automatically. At this step we have
the timing of each word in an episode.

   Shot segmentation is performed based on shot boundary detection (SBD) algorithm implemented in the
open-source Pyannote-Video toolkit [Bre15], which is based on displaced frame di↵erences (DFF) and uses
landmark features of the frames. Scene segmentation is performed based on grouping of adjacent shots and
relying on a combination of multimodal neural features using our previous work [BBG19]. In this work, we
define a scene as a set of contiguous shots which are connected by a central concept or theme. The method
detects shot boundaries and compute visual features of each shot using VGG16 pretrained model provided by
[SZ14] to extract deep visual features for each frame and embed each speech (text) inside a shot boundary using
word2vec model built on subtitles and books of Game of Thrones and the temporal information of each shot
is augmented. After the features are computed a shot clustering is performed and each shot is grouped into a
cluster and then a sequence grouping algorithm is used to regroup adjacent shots together to form a scene segment.

   Finally, each scene is represented by its narrative characteristics or elements. The narrative elements are
characters (speaking characters and characters mentioned in the conversation inside a scene), entity mentions
(locations and organizations) and the theme of the scene. Figure 1 depicts the characterization of a scene and
the way it is computed is described in Section 4.1.

4.1    Automatic scene characterization
In order to automatically represent the semantic content of a collection of short documents (in this case, scenes),
vector representation of words and documents (word2vec and doc2vec respectively), term frequencyinverse
document frequency (TF-IDF) and latent Dirichlet allocation (LDA) are among the most famous and e↵ective
methods. Considering the narrative elements, the automatic scene characterization is done as follows:

   First, we extracted the speaking characters (from manually annotated transcripts) and the entities that are
involved in a scene. Transcripts with their speaking character names are scraped and the names are normalized
to our standard character naming, since the transcripts come from di↵erent websites that can use di↵erent
naming of characters. E.g. a character nick named ”little finger” is normalized to ”petyr baelish”. There are
154 speaking characters in the first two season of Game of Thrones.

  Since any situation or events evolve around characters or entities, identifying entities will serve as a
connection-link between scenes that have the same events or situations based on the common mentioned entities.
    1 https://www.imdb.com/list/ls068919538/


                                                      49
                                               Figure 1: Scene Characterization
State of the art neural based named entity recognition (NER) called Flair [ABV18] is used to extract name
mentions in the dialogues of each scene. The Flair NER technique performs better than Stanford CoreNLP in
detecting entities with longer names, for example titles like ’Robert of the House Baratheon’ are detected by
Flair but not by CoreNLP.

   Then, the main keywords that can represent the theme inside a scene are extracted. TF-IDF is used to
extract the 10 most representative keywords from scene text, since it is the simplest and efficient method for
extracting keywords and capture the importance of a word from a short text. The first five seasons of Game
of Thrones is used as the documents collection, using scene as documents. The document collection is then
composed by 1018 scenes and 92970 words. If a scene has less than 10 words then all the words are assumed as
the keywords of a scene.

   Nevertheless, TF-IDF keyword extraction method may not captures words which are synonyms, therefore
we added a topic model to assign a topic, which may capture the main theme, for each scene. To this end,
the topic of each scene is extracted so that we can relate the topic of a scene with another. We used Latent
Dirichlet Algorithm (LDA), to assign each scene to one topic (the one associated with the highest membership
or probability value) by using all the TV series as the corpus and scenes as documents.

   Finally, a document to vector (Doc2Vec) embedding is used to represent the scenes’ transcript. Each scene
is treated as a document and is represented by a vector using Doc2Vec [LM14]. The embeddings of each scene
have a vectors size of 100 values. This Doc2Vec representation of scenes’ transcript can then be used to compute
the semantic similarity of scenes, considering that scenes that talk about the same stories have a high content
similarity.

5     Description of the annotation
Our data focused on the Game of Throne TV series which is one of the most complicated and popular TV
series. It is complex due to the number of stories, and the number of characters and their intertwined stories.

   We have annotated2 the first 2 seasons of Game of Thrones by assigning stories from predefined stories for the
purpose of linking scenes, though almost all stories try to converge into one story. The annotation is performed
by one annotator and it took around one hour and thirty minutes per episode. The annotated dataset have 444
scenes3 with 46 main stories and 151 sub-stories. The largest story is composed of 76 scenes and the smallest
only of 1 scene. The main stories have a maximum of 11 sub stories and 1 minimum sub-story. The sub-stories
    2 https://github.com/aman-berhe/Game-of-Thrones-Dataset
    3 Scenes that do not contain speech are ignored in the characterization step (87 scenes).


                                                                50
increase as we continue to the seasons of Game of Thrones and some new stories are also created. Figure 2
illustrates the average number of stories (resp. sub-stories) per scene, while Figure 3 shows the average number
of speaking characters per scene. Most of the scenes contain only 2 speaking characters. The maximum number
of speaking characters in a scene is 8 and the minimum is 14 .


                           Figure 2: The average number of stories and sub stories per scene


                            Figure 3: The average number of speaking characters per scene

   For the two seasons of Game of Thrones, the dataset has 444 scenes with an average scene length of 123.23
seconds, a total of 92970 words and 70 most reportable scenes (MRS). Story wise, the dataset has 197 predefined
stories (main stories and sub stories). In average, there are 3 MRS per story.

6     Conclusion
Scene linking as a way to construct the narrative structure is a good way of representing a very complicated
and long connected stories, in TV series or other similar documents. The complication in TV series is that all
scene stories feed to the main story of the TV series.

  As the manual annotation is quite difficult and time consuming, a more complicated automatic annotation
can be achieved for better scene characterization and linking besides what we have proposed.

   Clustering techniques can be used on the dataset for creating links between scenes. Thus, soft or fuzzy
clustering can group an element of the document in more than one cluster. And in case of scenes, fuzzy
clustering can group a scene in multiple clusters which in turn can capture the membership of a scene to
multiple stories. Additionally, most repeatable scene (MRS) can be used as a connection center for scene that
come before and after the MRS.

    4 Unknown characters, for example ”soldier#1” are removed, as they have no value for scene linking.


                                                              51
   In the near future we plan to work on automatic MRS detection and fuzzy clustering of scenes. Di↵erent ways
of document to vector representations can also be used to represent the scenes’ text, rather than just Doc2Vec
model.

Acknowledgements
This research was partly funded by the French National Research Agency (ANR) through the PLUMCOT
(ANR-16-CE92-0025) project, and the Digiteo StoryArcs project.

References
[Luc68] Lucas, Donald W. Aristotle Poetics. University of Michigan Press, 1968.

[TW69] Todorov, Tzvetan and Weinstein, Arnold. Structural Analysis of Narrative. Novel: a Forum on Fiction,
       70–76, 1969.

[Fre90]   Freytag, Gustav. Die Technik des Dramas. S. Hirzel, 1890.

[Pro10] Propp, Vladimir. Morphology of the Folktale. University of Texas Press, 2010.

[Lea70] Leach, Edmund. Claude Levi Struass (Modern Master Series edited by Frank Kermode). University of
        Chicago Press, 1970.

[KM16] Kim, Joy and Monroy-Hernandez, Andres. Storia: Summarizing SMContent Based on Narrative Theory
       Using Crowdsourcing. ACM on Computer-Supported Cooperative Work & Social Computing, 1018–1027,
       2016.

[EBS + 11] Ercolessi, Philippe and Bredin, Hervé and Sénac, Christine and Joly, Philippe. Segmenting TV
         Series into Scenes Using Speaker Diarization. Workshop on Image Analysis for Multimedia Interactive
         Services, 13–15, 2011.

[BGJ + 17] Bois, Rémi and Gravier, Guillaume and Jamet, Eric and Morin, Emmanuel and Robert, Maxime
         and Sébillot, Pascale. Linking Multimedia Content for Efficient News Browsing. ACM on Multimedia
         Retrieval, 301–307, 2017.

[CSR18] Chaturvedi, Snigdha and Srivastava, Shashank and Roth, Dan. Where Have I Heard This Story Before?
        Identifying Narrative Similarity in Movie Remakes. NAACL on Human Language Technologies, 673–
        678, 2018.

[Var17] Vargas, Josep Valls. Narrative Information Extraction with Non-Linear Natural Language Processing
        Pipelines. Drexel University, 2017.

[AS90]    Andersen, Sandy and Slator, Brian M. Requiem for a theory: the story grammarstory. Journal of
          Experimental & Theoretical Artificial Intelligence, 253–275, 1990.

[BBG19] Berhe, Aman and Barras, Claude and Guinaudeau, Camille. Video Scene Segmentation of TV Series
        Using Multimodal Neural Features. Series - International Journal of TV Serial Narratives, 59–68,
        2019.

[ABV18] Akbik, Alan and Blythe, Duncan and Vollgraf, Roland. Contextual String Embeddings for Sequence
        Labeling. COLING on Computational Linguistics, 1638–1649, 2018.

[ESB12] Ercolessi, Philippe and Sénac, Christine and Bredin, Hervé. Toward Plot De-interlacing in TV Series
        Using Scenes Clustering. Content-Based Multimedia Indexing, 1–6, 2012.

[LCW + 17] Li, Boyang and Cardier, Beth and Wang, Tong and Metze, Florian. Annotating High-Level Structures
        of Short Stories and Personal Anecdotes. LREC, 2017.

[EF19]    Eisenberg, Joshua and Finlayson, Mark. Annotation Guideline No. 1: Cover Sheet for Narrative
          Boundaries Annotation Guide. Journal of Cultural Analytics, 11199, 2019.


                                                    52
[Bos16] Bost, Xavier. A Storytelling Machine?: Automatic Video Summarization: the Case of TV Series.
        Université d’Avignon, 2016.

[GLV14] Garcia-Fernandez, Anne and Ligozat, Anne-Laure and Vilnat, Anne. Construction and Annotation of
        a French Folkstale Corpus. LREC, 2430–2435, 2014.
[Ash87] Ashliman, Dee L. A guide to folktales in the English language: Based on the Aarne-Thompson Classi-
        fication System. Greenwood Press New York, 1987.
[BV S + 17] Bois, Rémi and Vukotić, Vedran and Simon, Anca-Roxana and Sicre, Ronan and Raymond, Christian
          and Sébillot, Pascale and Gravier, Guillaume. Exploiting Multimodality in Video Hyperlinking to
          Improve Target Diversity. Multimedia Modeling, 185–197, 2017.
[Bre15]   Bredin, Hervé. pyannote Video: a toolkit for shot detection, shot threading and face trucking.
          https://github.com/pyannote/pyannote-video, 2015.

[SZ14]    Simonyan, Karen and Zisserman, Andrew. Very deep convolutional networks for large-scale image
          recognition. ICLR, 2014.
[BDG18] Budnik, Mateusz and Demirdelen, Mikail and Gravier, Guillaume. A Study on Multimodal Video
        Hyperlinking with Visual Aggregation. IEEE on Multimedia and Expo (ICME), 1–6, 2018.
[BGP + 15] Bois, Rémi and Gravier, Guillaume and Pascale, Sébillot and Emmanuel, Morin. Vers Une Typologie
         de Liens Entre Contenus Journalistiques. TALN, 525–521, 2015.
[OAB15] Ordelman, Roeland and Aly, Robin and Benoit, Huet. Convenient Discovery of Archived Video Using
        Audiovisual Hyperlinking. Workshop on Speech, Language & Audio in Multimedia, 23–26, 2015.
[AF M + 16] Awad, George and Fiscus, Jonathan and Martial, Michelet, and Alan F., Smeaton. Trecvid 2016:
         Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking. TREC Video Re-
         trieval Evaluation (TRECVID), 2016.
[LM14]    Le, Quoc and Mikolov, Tomas. Distributed representations of sentences and documents. International
          conference on machine learning, 1188–1196, 2014.
[GLA02] Gauvain, Jean-Luc and Lamel, Lori and Adda, Gilles. The LIMSI broadcast news transcription system.
        Speech communication, 89–108, 2002.


                                                   53