Breaking the Narrative: Scene Segmentation
                            through Sequential Sentence Classification

                  Murathan Kurfalı                                                     Mats Wirén
                Department of Linguistics                                        Department of Linguistics
                 Stockholm University                                             Stockholm University
                  Stockholm, Sweden                                                Stockholm, Sweden
           murathan.kurfali@ling.su.se                                         mats.wiren@ling.su.se


                          Abstract                                        Our main interest in the current paper is to ex-
                                                                       plore whether scene segmentation can be handled
        In this paper, we describe our submission                      as a sequential sentence classification task. To this
        to the Shared Task on Scene Segmentation                       end, we follow the methodology proposed in Co-
        (STSS). The shared task requires participants
                                                                       han et al. (2019), which encodes all sentences in
        to segment novels into coherent segments,
        called scenes. We approach this as a sequential
                                                                       a sequence jointly through BERT (Devlin et al.,
        sentence classification task and offer a BERT-                 2019) to directly leverage the contextual informa-
        based solution with a weighted cross-entropy                   tion from all tokens in the sequence at the same
        loss. According to the results, the proposed                   time. The model of Cohan et al. (2019) is further
        approach performs relatively well on the task                  adapted to the task via introduction of a weighted
        as our model ranks first and second, in official               cross-entropy loss in order to account for the im-
        in-domain and out-domain evaluations, respec-                  balanced distribution of the labels in the dataset.
        tively. However, the overall low performances
        (0.37 F1 -score) suggest that there is still much
                                                                          According to the official results, our model
        room for improvement.                                          achieves the best performance on the in-domain
                                                                       texts, significantly outperforming the second-
1       Introduction                                                   ranking system. However, the performance drops
                                                                       when evaluated on out-of-domain novels, suggest-
Scene segmentation is a novel task introduced in                       ing that the proposed methodology only poorly
(Zehe et al., 2021a) that aims to divide long nar-                     generalizes over different domains. We release
rative texts, e.g. novels, into smaller coherent seg-                  our system to facilitate reproducibility and future
ments or scenes, as they are called. Scenes, in this                   work.2
context, can be roughly defined as “a segment of
a text where the story time and the discourse time                     2       System Overview
are more or less equal, the narration focuses on one
action and space and character constellations stay                     2.1      Task Details
the same” (Zehe et al., 2021a).1 The task of scene                     The scene segmentation task can be framed in sev-
segmentation is of great value on several ends: (i) it                 eral ways. Within the shared task, it is defined as
can be directly employed in several digital humani-                    the identification of the boundaries that delimit the
ties tasks, e.g. plot reconstruction; (ii) segmenting                  consecutive segments (Zehe et al., 2021b). The
longer texts into smaller coherent pieces help other                   boundaries between segments are labeled accord-
NLP tasks, e.g. co-reference resolution, that strug-                   ing to the types of segments they delimit. Specif-
gle with texts longer than a couple of paragraphs                      ically, a boundary can belong one of the follow-
(Joshi et al., 2020); (iii) as a novel task that requires              ing three classes: Scene-Scene; Nonscene-Scene;
high-level modeling of long texts, it offers itself as                 Scene-Nonscene.3
a valuable probing task to evaluate language mod-                         The participating teams are evaluated only ac-
els on long-context scenarios which is an active                       cording to their success at finding and labeling
research area (Tay et al., 2020).
                                                                           2
                                                                            https://github.com/MurathanKurfali/scene segmentation
    1                                                                      3
      Interested readers are referred to annotation guidelines              Unlike Scenes, Nonscenes, naturally, are not distin-
available at https://zenodo.org/record/4457177 for further de-         guished from one another; hence, Nonscene-Nonscene is not
tails.                                                                 a valid transition.

                                                                  49
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
these boundaries. That is to say, classification of
an individual sentence as belonging to a Scene or a
Nonscene means very little in the evaluations. Of
the possible three transitions, Scene-Scene is the
most common one as Nonscenes are significantly
less frequent in data (see Table 1).

2.2 Our Model
We model scene segmentation as a Sequence Sen-
tence Classification (SSC) task where the goal is              Figure 1: Overview of the system architecture. Each
to understand whether a given sentence is segment-             sentence is represented by the respective [SEP] token
initial or not along with the type of segment it               which is used to predict the label. Figure copied from
                                                               Cohan et al. (2019).
belongs to. Similarly to the more common token
classification tasks, e.g. POS-tagging or NER, we
employ the IOB2 format and assign a tag to each                for classification. Fine-tuning BERT in this way
sentence. Specifically, we label segment-initial sen-          has the benefit of simultaneously leveraging the
tences (boundaries) as #X-B and other sentences as             contextual information from all sentences in the
merely #X where X indicates the type of segment                sequence.
(Scene or Nonscene).
                                                               Loss function The model is trained to minimize
   Our classifier closely follows the methodology
                                                               the cross-entropy loss between the probabilities
proposed in Cohan et al. (2019). Here, the authors
                                                               over the possible labels computed using a soft-
employed BERT to perform several document-level
                                                               max activation and the target distribution. How-
classification tasks, e.g. abstract sentence classifi-
                                                               ever, during the initial experiments, we observed
cation, where the aim is to classify sentences in a
                                                               that the model severely suffered from the highly
scientific abstract into their rhetorical roles such as
                                                               skewed label distribution, namely the low num-
introduction, method, etc. The rest of this section
                                                               ber of boundary sentences in comparison to non-
describes the model along with our modifications.
                                                               boundary ones.4 In order to mitigate this issue, fol-
   The proposed methodology follows the standard
                                                               lowing the previous studies (Rotsztejn et al., 2018;
way of using BERT through fine-tuning on the tar-
                                                               Cui et al., 2019; Yang et al., 2019), we introduce
get task but uses a novel input representation. The
                                                               a weighting factor to the loss function where each
classifier used in the experiments is illustrated in
                                                               class is assigned a weight that is inversely propor-
Figure 1. As input, a sequence of N sentences
                                                               tionally to their frequency in the training set:
is concatenated by BERT’s special delimiter to-
                                                                                          P
ken [SEP], yielding one long sequence. This se-                                               freq(i)
quence, after the insertion of the standard [CLS]                             weightc = i
                                                                                            freq(c)
token at the beginning, is fed into BERT. How-
                                                               where freq indicates the count of a certain class.
ever, unlike the standard way of using the [CLS]
                                                               Overall, the weighted
                                                                                  P        cross-entropy becomes
token as the representation of the input sequence,
                                                               LossW CE = − C       c w c c log(sc ) where wc is the
                                                                                         t
the representations of the individual [SEP] tokens
                                                               weight, tc is the gold truth value (taking either 0 or
are used as the representations of the sentences
                                                               1), and sc is the corresponding Softmax probability
that precede them. Hence, instead of the [CLS]
                                                               of the class c.
token, [SEP] representations are classified by a
multi-layer feedforward network to reach labels.               3     Experimental Setup
   The rationale for using [SEP] as sentence repre-
                                                               3.1    Data
sentation has to do with the next-sentence objective
of BERT: “Intuitively, through BERT’s pretraining,             The dataset used in this shared task is based on
the [SEP] tokens learn sentence structure and rela-            an expanded version of the annotation effort intro-
tions between continuous sentences” (Cohan et al.,             duced in (Zehe et al., 2021a), and consists of 20
2019). During fine-tuning, the model is further                German novels in total, excluding the blind test sets
primed to assign appropriate weights to [SEP]                      4
                                                                     The most frequent label, Scene, single-handedly accounts
tokens to encode necessary contextual information              for 96.1% of the training data.

                                                          50
        Split                       Scene                                             Non-scene
                   Count    |Avg. Segment|          |Avg. Sent|       Count     |Avg. Segment|          |Avg. Sent|
        Training   1075          45.33                 10.58           51            16.11                 15.39
        Dev         127           69.5                 12.12            7             5.6                  18.20
        Test        46           38.16                 8.05             7            19.14                 11.15

Table 1: Characteristics of the train/dev/test splits used in model development. The numbers in the columns refer
to number of segments, the average size of a segment (in terms of # of sentences) and the average size of a sentence
(in terms of # of words) for Scenes and Non-scenes separately.


used in the official evaluations. During model de-               the mean micro-averaged F1 scores, is provided in
velopment, we create custom development and test                 Table 3. According to the official rankings, the pro-
sets by randomly allocating one file for each, using             posed approach is good at segmenting in-domain
the remaining 18 files for training.5 The statistics             novels and outperforms the second best system by
regarding the training/dev/test splits used during               some margin. However, the performance signifi-
model development are provided in Table 1.                       cantly drops when evaluated on out-of-domain nov-
                                                                 els, suggesting that the system generalizes poorly
3.2 Parameter Setting                                            across domains.
We follow the implementation of Cohan et al.                        According to Table 2, our model is best at rec-
(2019).6 As the language model, we use the large                 ognizing Scene to Scene transitions; however, it is
German BERT model from (Chan et al., 2020)                       almost completely incapable of finding the borders
(dubbed GBERT-large7 ) due its superior perfor-                  between non-scenes and scenes. Suggested by the
mance over the existing German models. The batch                 high-recall, low-precision scores, our model tends
size of 8 and gradient accumulation steps of 4 are               to over-segment the novels. On average, the system
used to reach effective batch size of 32. All exper-             divides the in-domain novels into 1.76 and out-
iments are run on a single V100 GPU. We set the                  of-domain novels into 1.61 times more segments.
learning rate to 5e-6 and the training is run for the            This tendency towards over-segmenting hints at
maximum of 100 epochs with the early stopping                    over-sensitivity to certain markers which is further
applied (patience = 20) based on the performance                 discussed in the next section.
on the development set. Due to BERT’s inherit se-                   Overall, the results clearly demonstrate that the
quence size limit, we set a threshold of 25 sentences            task is extremely challenging even in the in-domain
in each sequence which is chosen empirically (i.e.,              setting. The poor performance of solutions based
according to the performance on the in-house test                on contextual embeddings8 highlight the need for
set) among the set of {20, 25, 30, 50}.                          novel architectures. One obvious drawback with
                                                                 BERT-based models is their inability to encode
4       Results and Discussion                                   long sequences. Hence, a straightforward extension
The official evaluation is performed on two differ-              of the current model would be to employ a model
ent test sets:                                                   which supports longer contexts, e.g. Longformer
                                                                 (Beltagy et al., 2020); however, such a model is
    i. Test suite 1 focuses on in-domain evaluation              unfortunately not available for German at the time
       and consists of 5 annotated dime novels,                  of writing.
 ii. Test suite 2 focuses on out-of-domain evalua-
                                                                 5   Error Analysis
     tion and consists of 2 annotated contemporary
     high-literature texts.                                      In addition to the official evaluation, we performed
                                                                 a manual error analysis of our model’s predictions
   Table 2 presents the breakdown of our results
                                                                 on the in-house test set (see Section 3.1). One ob-
into each possible transition whereas the official
                                                                 servation was that in certain cases, although the
ranking of the participating systems, according to
                                                                 model correctly recognized the type of transition
    5
     Files 9783740941093 and 9783732522033 are used the          (e.g. Scene-Scene), it misplaced the boundary only
dev and test set, respectively.
   6                                                                  8
     https://github.com/allenai/sequential sentence classification      A BERT-based baseline in the original resource paper
   7
     https://huggingface.co/deepset/gbert-large                    similarly fails on this task (Zehe et al., 2021a).

                                                            51
                                                    In-domain                      Out-of-domain
                                           Prec.    Rec.      F1 -score       Prec.    Rec.    F1 -score
                   Scene-to-Scene          0.31     0.64        0.42          0.14     0.26      0.19
                  Scene-to-Nonscene        0.08     0.06        0.07          0.00     0.00      0.00
                  Nonscene-to-Scene        0.00     0.00        0.00          0.00     0.00      0.00
                    Micro average            0.29    0.51          0.37       0.14     0.22       0.17
                    Macro average            0.13    0.23          0.16       0.05     0.09       0.06
                   Weighted average          0.25    0.51          0.33       0.12     0.22       0.16

Table 2: Official results of our submission (prec(ision, rec(all) and F1 -score) for each type of transition along with
the averaged results


             Rank      Track 1     Track 2                        Examples 2–5:10
              1.       0.37        0.26
              2.       0.16        0.17                             (2) Als das Gefährt das Bergplateau erreicht
              3.       0.07        0.12                                 hatte, ließ der Fahrer einige Male laut die
              4.       0.02        0.11                                 Hupe ertönen.
              5.       -           0.04                                 (When the vehicle had reached the moun-
                                                                        tain plateau, the driver sounded the horn a
Table 3: Official rankings and results of all partici-                  few times.)
pating systems, according to the micro-averaged F1
scores, averaged over all the novels in the correspond-             (3) Eines Abends, als Graf Harro von einer
ing suite. Results of our submission is highlighted in                  Herrengesellschaft zeitiger nach Hause
boldface.                                                               kam, als man erwartete, fand er seine Gat-
                                                                        tin in einer sehr zärtlichen Stellung mit
                                                                        dem jungen Prinzen.
by a single sentence. An instance of this can be
                                                                        (One evening, when Count Harro came
seen in Example (1), where the predicted bound-
                                                                        home earlier than expected from a gen-
ary appears just before sentence (1a), whereas the
                                                                        tlemen’s company, he found his wife in a
gold boundary appears just before the subsequent
                                                                        very affectionate position with the young
sentence (1b):9
                                                                        prince.)
  (1)    a. Und bald darauf fuhr der Wagen                          (4) Und am nächsten Morgen fand man die
            aus dem Wald und einen allmählich                          Gräfin Alice tot auf ihrem Lager.
            ansteigenden Berg hinan.                                    (And the next morning the Countess Alice
            (And soon afterwards the car drove out                      was found dead in her bed.)
            of the forest and up a gradually rising
                                                                    (5) Er wandte sich um und ging wieder zurück,
            mountain.)
                                                                        bis in das Zimmer, wo der Schreibtisch der
         b. Dort oben lag das Schloss Treuenfels.                       Gräfin Alice stand.
            (Treuenfels Castle was up there.)                           (He turned and went back to the room
                                                                        where Countess Alice’s desk was.).
   Furthermore, as mentioned in the previous sec-
tion, the system tends to over-segment the novels.                   Similar to the behavior of the baseline system
A manual inspection of false positives (sentences                 proposed in Zehe et al. (2021a), these examples
that are erroneously identified as segment bound-                 highlight the model’s sensitivity to the local cues
aries) reveals that despite being incorrect, these                rather than the larger context. That is, to a certain
predictions are not completely random. Most of                    extent, the system makes its predictions according
the false positives involve an adverbial or other                 to the individual phrases that signal shifts in time
kind of phrase which signals a shift in time and/or               or place, paying too little attention to the global
place. Some cherry-picked examples are given in                   context.
   9                                                                10
     The English translations have been produced by Google             In these examples, the system has predicted the shift
Translate.                                                        immediately before the sentences displayed.

                                                             52
6   Conclusion                                                Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld,
                                                               Luke Zettlemoyer, and Omer Levy. 2020. Spanbert:
The current paper summarizes our submission to                 Improving pre-training by representing and predict-
the Shared Task on Scene Segmentation (STSS).                  ing spans. Transactions of the Association for Com-
                                                               putational Linguistics, 8:64–77.
We handle scene segmentation as a sequential sen-
tence classification task and offer a BERT-based              Jonathan Rotsztejn, Nora Hollenstein, and Ce Zhang.
solution. The proposed model achieves the best                  2018. Eth-ds3lab at semeval-2018 task 7: Effec-
performance in the in-domain evaluations but falls              tively combining recurrent and convolutional neu-
                                                                ral networks for relation classification and extraction.
short of transferring its performance across do-                In Proceedings of The 12th International Workshop
mains. Error analysis further reveals that the pre-             on Semantic Evaluation, pages 689–696.
dictions are more sensitive to local cues rather than
                                                              Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang
the global structure of the text, highlighting the
                                                                Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu
need for better document-level modeling.                        Yang, Sebastian Ruder, and Donald Metzler. 2020.
                                                                Long range arena: A benchmark for efficient trans-
Acknowledgments                                                 formers. arXiv preprint arXiv:2011.04006.

This work has been partly funded by an infras-                Kisu Yang, Dongyub Lee, Taesun Whang, Seolhwa
tructure grant from the Swedish Research Coun-                  Lee, and Heuiseok Lim. 2019. Emotionx-ku: Bert-
                                                                max based contextual emotion classifier. arXiv
cil (SWE-CLARIN, 2019–24; contract no. 2017-                    preprint arXiv:1906.11565.
00626). We thank the Swedish National Infrastruc-
ture for Computing (SNIC) for providing computa-              Albin Zehe, Leonard Konle, Lea Katharina
                                                                Dümpelmann, Evelyn Gius, Andreas Hotho,
tional resources under Project 2020/33-26.                      Fotis Jannidis, Lucas Kaufmann, Markus Krug,
                                                                Frank Puppe, Nils Reiter, et al. 2021a. Detecting
                                                                scenes in fiction: A new segmentation task. In
References                                                      Proceedings of the 16th Conference of the European
                                                                Chapter of the Association for Computational
Iz Beltagy, Matthew E Peters, and Arman Cohan.                  Linguistics: Main Volume, pages 3167–3177.
   2020. Longformer: The long-document transformer.
   arXiv preprint arXiv:2004.05150.                           Albin Zehe, Leonard Konle, Svenja Guhr, Lea Katha-
                                                                rina Dümpelmann, Evelyn Gius, Andreas Hotho, Fo-
Branden Chan, Stefan Schweter, and Timo Möller.                tis Jannidis, Lucas Kaufmann, Markus Krug, Frank
  2020. German’s next language model. In Proceed-               Puppe, Nils Reiter, and Annekea Schreiber. 2021b.
  ings of the 28th International Conference on Com-             Shared task on scene segmentation@konvens2021.
  putational Linguistics, pages 6788–6796.                      In Shared Task on Scene Segmentation.

Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi,
  and Dan Weld. 2019. Pretrained language models
  for sequential sentence classification. In Proceed-
  ings of the 2019 Conference on Empirical Methods
  in Natural Language Processing and the 9th Inter-
  national Joint Conference on Natural Language Pro-
  cessing (EMNLP-IJCNLP), pages 3693–3699, Hong
  Kong, China. Association for Computational Lin-
  guistics.

Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and
  Serge Belongie. 2019. Class-balanced loss based on
  effective number of samples. In Proceedings of the
  IEEE/CVF conference on computer vision and pat-
  tern recognition, pages 9268–9277.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
   Kristina Toutanova. 2019. BERT: Pre-training of
   deep bidirectional transformers for language under-
   standing. In Proceedings of the 2019 Conference
   of the North American Chapter of the Association
   for Computational Linguistics: Human Language
  Technologies, Volume 1 (Long and Short Papers),
   pages 4171–4186, Minneapolis, Minnesota. Associ-
   ation for Computational Linguistics.

                                                         53