Breaking the Narrative: Scene Segmentation through Sequential Sentence Classification Murathan Kurfalı Mats Wirén Department of Linguistics Department of Linguistics Stockholm University Stockholm University Stockholm, Sweden Stockholm, Sweden murathan.kurfali@ling.su.se mats.wiren@ling.su.se Abstract Our main interest in the current paper is to ex- plore whether scene segmentation can be handled In this paper, we describe our submission as a sequential sentence classification task. To this to the Shared Task on Scene Segmentation end, we follow the methodology proposed in Co- (STSS). The shared task requires participants han et al. (2019), which encodes all sentences in to segment novels into coherent segments, called scenes. We approach this as a sequential a sequence jointly through BERT (Devlin et al., sentence classification task and offer a BERT- 2019) to directly leverage the contextual informa- based solution with a weighted cross-entropy tion from all tokens in the sequence at the same loss. According to the results, the proposed time. The model of Cohan et al. (2019) is further approach performs relatively well on the task adapted to the task via introduction of a weighted as our model ranks first and second, in official cross-entropy loss in order to account for the im- in-domain and out-domain evaluations, respec- balanced distribution of the labels in the dataset. tively. However, the overall low performances (0.37 F1 -score) suggest that there is still much According to the official results, our model room for improvement. achieves the best performance on the in-domain texts, significantly outperforming the second- 1 Introduction ranking system. However, the performance drops when evaluated on out-of-domain novels, suggest- Scene segmentation is a novel task introduced in ing that the proposed methodology only poorly (Zehe et al., 2021a) that aims to divide long nar- generalizes over different domains. We release rative texts, e.g. novels, into smaller coherent seg- our system to facilitate reproducibility and future ments or scenes, as they are called. Scenes, in this work.2 context, can be roughly defined as “a segment of a text where the story time and the discourse time 2 System Overview are more or less equal, the narration focuses on one action and space and character constellations stay 2.1 Task Details the same” (Zehe et al., 2021a).1 The task of scene The scene segmentation task can be framed in sev- segmentation is of great value on several ends: (i) it eral ways. Within the shared task, it is defined as can be directly employed in several digital humani- the identification of the boundaries that delimit the ties tasks, e.g. plot reconstruction; (ii) segmenting consecutive segments (Zehe et al., 2021b). The longer texts into smaller coherent pieces help other boundaries between segments are labeled accord- NLP tasks, e.g. co-reference resolution, that strug- ing to the types of segments they delimit. Specif- gle with texts longer than a couple of paragraphs ically, a boundary can belong one of the follow- (Joshi et al., 2020); (iii) as a novel task that requires ing three classes: Scene-Scene; Nonscene-Scene; high-level modeling of long texts, it offers itself as Scene-Nonscene.3 a valuable probing task to evaluate language mod- The participating teams are evaluated only ac- els on long-context scenarios which is an active cording to their success at finding and labeling research area (Tay et al., 2020). 2 https://github.com/MurathanKurfali/scene segmentation 1 3 Interested readers are referred to annotation guidelines Unlike Scenes, Nonscenes, naturally, are not distin- available at https://zenodo.org/record/4457177 for further de- guished from one another; hence, Nonscene-Nonscene is not tails. a valid transition. 49 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). these boundaries. That is to say, classification of an individual sentence as belonging to a Scene or a Nonscene means very little in the evaluations. Of the possible three transitions, Scene-Scene is the most common one as Nonscenes are significantly less frequent in data (see Table 1). 2.2 Our Model We model scene segmentation as a Sequence Sen- tence Classification (SSC) task where the goal is Figure 1: Overview of the system architecture. Each to understand whether a given sentence is segment- sentence is represented by the respective [SEP] token initial or not along with the type of segment it which is used to predict the label. Figure copied from Cohan et al. (2019). belongs to. Similarly to the more common token classification tasks, e.g. POS-tagging or NER, we employ the IOB2 format and assign a tag to each for classification. Fine-tuning BERT in this way sentence. Specifically, we label segment-initial sen- has the benefit of simultaneously leveraging the tences (boundaries) as #X-B and other sentences as contextual information from all sentences in the merely #X where X indicates the type of segment sequence. (Scene or Nonscene). Loss function The model is trained to minimize Our classifier closely follows the methodology the cross-entropy loss between the probabilities proposed in Cohan et al. (2019). Here, the authors over the possible labels computed using a soft- employed BERT to perform several document-level max activation and the target distribution. How- classification tasks, e.g. abstract sentence classifi- ever, during the initial experiments, we observed cation, where the aim is to classify sentences in a that the model severely suffered from the highly scientific abstract into their rhetorical roles such as skewed label distribution, namely the low num- introduction, method, etc. The rest of this section ber of boundary sentences in comparison to non- describes the model along with our modifications. boundary ones.4 In order to mitigate this issue, fol- The proposed methodology follows the standard lowing the previous studies (Rotsztejn et al., 2018; way of using BERT through fine-tuning on the tar- Cui et al., 2019; Yang et al., 2019), we introduce get task but uses a novel input representation. The a weighting factor to the loss function where each classifier used in the experiments is illustrated in class is assigned a weight that is inversely propor- Figure 1. As input, a sequence of N sentences tionally to their frequency in the training set: is concatenated by BERT’s special delimiter to- P ken [SEP], yielding one long sequence. This se- freq(i) quence, after the insertion of the standard [CLS] weightc = i freq(c) token at the beginning, is fed into BERT. How- where freq indicates the count of a certain class. ever, unlike the standard way of using the [CLS] Overall, the weighted P cross-entropy becomes token as the representation of the input sequence, LossW CE = − C c w c c log(sc ) where wc is the t the representations of the individual [SEP] tokens weight, tc is the gold truth value (taking either 0 or are used as the representations of the sentences 1), and sc is the corresponding Softmax probability that precede them. Hence, instead of the [CLS] of the class c. token, [SEP] representations are classified by a multi-layer feedforward network to reach labels. 3 Experimental Setup The rationale for using [SEP] as sentence repre- 3.1 Data sentation has to do with the next-sentence objective of BERT: “Intuitively, through BERT’s pretraining, The dataset used in this shared task is based on the [SEP] tokens learn sentence structure and rela- an expanded version of the annotation effort intro- tions between continuous sentences” (Cohan et al., duced in (Zehe et al., 2021a), and consists of 20 2019). During fine-tuning, the model is further German novels in total, excluding the blind test sets primed to assign appropriate weights to [SEP] 4 The most frequent label, Scene, single-handedly accounts tokens to encode necessary contextual information for 96.1% of the training data. 50 Split Scene Non-scene Count |Avg. Segment| |Avg. Sent| Count |Avg. Segment| |Avg. Sent| Training 1075 45.33 10.58 51 16.11 15.39 Dev 127 69.5 12.12 7 5.6 18.20 Test 46 38.16 8.05 7 19.14 11.15 Table 1: Characteristics of the train/dev/test splits used in model development. The numbers in the columns refer to number of segments, the average size of a segment (in terms of # of sentences) and the average size of a sentence (in terms of # of words) for Scenes and Non-scenes separately. used in the official evaluations. During model de- the mean micro-averaged F1 scores, is provided in velopment, we create custom development and test Table 3. According to the official rankings, the pro- sets by randomly allocating one file for each, using posed approach is good at segmenting in-domain the remaining 18 files for training.5 The statistics novels and outperforms the second best system by regarding the training/dev/test splits used during some margin. However, the performance signifi- model development are provided in Table 1. cantly drops when evaluated on out-of-domain nov- els, suggesting that the system generalizes poorly 3.2 Parameter Setting across domains. We follow the implementation of Cohan et al. According to Table 2, our model is best at rec- (2019).6 As the language model, we use the large ognizing Scene to Scene transitions; however, it is German BERT model from (Chan et al., 2020) almost completely incapable of finding the borders (dubbed GBERT-large7 ) due its superior perfor- between non-scenes and scenes. Suggested by the mance over the existing German models. The batch high-recall, low-precision scores, our model tends size of 8 and gradient accumulation steps of 4 are to over-segment the novels. On average, the system used to reach effective batch size of 32. All exper- divides the in-domain novels into 1.76 and out- iments are run on a single V100 GPU. We set the of-domain novels into 1.61 times more segments. learning rate to 5e-6 and the training is run for the This tendency towards over-segmenting hints at maximum of 100 epochs with the early stopping over-sensitivity to certain markers which is further applied (patience = 20) based on the performance discussed in the next section. on the development set. Due to BERT’s inherit se- Overall, the results clearly demonstrate that the quence size limit, we set a threshold of 25 sentences task is extremely challenging even in the in-domain in each sequence which is chosen empirically (i.e., setting. The poor performance of solutions based according to the performance on the in-house test on contextual embeddings8 highlight the need for set) among the set of {20, 25, 30, 50}. novel architectures. One obvious drawback with BERT-based models is their inability to encode 4 Results and Discussion long sequences. Hence, a straightforward extension The official evaluation is performed on two differ- of the current model would be to employ a model ent test sets: which supports longer contexts, e.g. Longformer (Beltagy et al., 2020); however, such a model is i. Test suite 1 focuses on in-domain evaluation unfortunately not available for German at the time and consists of 5 annotated dime novels, of writing. ii. Test suite 2 focuses on out-of-domain evalua- 5 Error Analysis tion and consists of 2 annotated contemporary high-literature texts. In addition to the official evaluation, we performed a manual error analysis of our model’s predictions Table 2 presents the breakdown of our results on the in-house test set (see Section 3.1). One ob- into each possible transition whereas the official servation was that in certain cases, although the ranking of the participating systems, according to model correctly recognized the type of transition 5 Files 9783740941093 and 9783732522033 are used the (e.g. Scene-Scene), it misplaced the boundary only dev and test set, respectively. 6 8 https://github.com/allenai/sequential sentence classification A BERT-based baseline in the original resource paper 7 https://huggingface.co/deepset/gbert-large similarly fails on this task (Zehe et al., 2021a). 51 In-domain Out-of-domain Prec. Rec. F1 -score Prec. Rec. F1 -score Scene-to-Scene 0.31 0.64 0.42 0.14 0.26 0.19 Scene-to-Nonscene 0.08 0.06 0.07 0.00 0.00 0.00 Nonscene-to-Scene 0.00 0.00 0.00 0.00 0.00 0.00 Micro average 0.29 0.51 0.37 0.14 0.22 0.17 Macro average 0.13 0.23 0.16 0.05 0.09 0.06 Weighted average 0.25 0.51 0.33 0.12 0.22 0.16 Table 2: Official results of our submission (prec(ision, rec(all) and F1 -score) for each type of transition along with the averaged results Rank Track 1 Track 2 Examples 2–5:10 1. 0.37 0.26 2. 0.16 0.17 (2) Als das Gefährt das Bergplateau erreicht 3. 0.07 0.12 hatte, ließ der Fahrer einige Male laut die 4. 0.02 0.11 Hupe ertönen. 5. - 0.04 (When the vehicle had reached the moun- tain plateau, the driver sounded the horn a Table 3: Official rankings and results of all partici- few times.) pating systems, according to the micro-averaged F1 scores, averaged over all the novels in the correspond- (3) Eines Abends, als Graf Harro von einer ing suite. Results of our submission is highlighted in Herrengesellschaft zeitiger nach Hause boldface. kam, als man erwartete, fand er seine Gat- tin in einer sehr zärtlichen Stellung mit dem jungen Prinzen. by a single sentence. An instance of this can be (One evening, when Count Harro came seen in Example (1), where the predicted bound- home earlier than expected from a gen- ary appears just before sentence (1a), whereas the tlemen’s company, he found his wife in a gold boundary appears just before the subsequent very affectionate position with the young sentence (1b):9 prince.) (1) a. Und bald darauf fuhr der Wagen (4) Und am nächsten Morgen fand man die aus dem Wald und einen allmählich Gräfin Alice tot auf ihrem Lager. ansteigenden Berg hinan. (And the next morning the Countess Alice (And soon afterwards the car drove out was found dead in her bed.) of the forest and up a gradually rising (5) Er wandte sich um und ging wieder zurück, mountain.) bis in das Zimmer, wo der Schreibtisch der b. Dort oben lag das Schloss Treuenfels. Gräfin Alice stand. (Treuenfels Castle was up there.) (He turned and went back to the room where Countess Alice’s desk was.). Furthermore, as mentioned in the previous sec- tion, the system tends to over-segment the novels. Similar to the behavior of the baseline system A manual inspection of false positives (sentences proposed in Zehe et al. (2021a), these examples that are erroneously identified as segment bound- highlight the model’s sensitivity to the local cues aries) reveals that despite being incorrect, these rather than the larger context. That is, to a certain predictions are not completely random. Most of extent, the system makes its predictions according the false positives involve an adverbial or other to the individual phrases that signal shifts in time kind of phrase which signals a shift in time and/or or place, paying too little attention to the global place. Some cherry-picked examples are given in context. 9 10 The English translations have been produced by Google In these examples, the system has predicted the shift Translate. immediately before the sentences displayed. 52 6 Conclusion Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. 2020. Spanbert: The current paper summarizes our submission to Improving pre-training by representing and predict- the Shared Task on Scene Segmentation (STSS). ing spans. Transactions of the Association for Com- putational Linguistics, 8:64–77. We handle scene segmentation as a sequential sen- tence classification task and offer a BERT-based Jonathan Rotsztejn, Nora Hollenstein, and Ce Zhang. solution. The proposed model achieves the best 2018. Eth-ds3lab at semeval-2018 task 7: Effec- performance in the in-domain evaluations but falls tively combining recurrent and convolutional neu- ral networks for relation classification and extraction. short of transferring its performance across do- In Proceedings of The 12th International Workshop mains. Error analysis further reveals that the pre- on Semantic Evaluation, pages 689–696. dictions are more sensitive to local cues rather than Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang the global structure of the text, highlighting the Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu need for better document-level modeling. Yang, Sebastian Ruder, and Donald Metzler. 2020. Long range arena: A benchmark for efficient trans- Acknowledgments formers. arXiv preprint arXiv:2011.04006. This work has been partly funded by an infras- Kisu Yang, Dongyub Lee, Taesun Whang, Seolhwa tructure grant from the Swedish Research Coun- Lee, and Heuiseok Lim. 2019. Emotionx-ku: Bert- max based contextual emotion classifier. arXiv cil (SWE-CLARIN, 2019–24; contract no. 2017- preprint arXiv:1906.11565. 00626). We thank the Swedish National Infrastruc- ture for Computing (SNIC) for providing computa- Albin Zehe, Leonard Konle, Lea Katharina Dümpelmann, Evelyn Gius, Andreas Hotho, tional resources under Project 2020/33-26. Fotis Jannidis, Lucas Kaufmann, Markus Krug, Frank Puppe, Nils Reiter, et al. 2021a. Detecting scenes in fiction: A new segmentation task. In References Proceedings of the 16th Conference of the European Chapter of the Association for Computational Iz Beltagy, Matthew E Peters, and Arman Cohan. Linguistics: Main Volume, pages 3167–3177. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150. Albin Zehe, Leonard Konle, Svenja Guhr, Lea Katha- rina Dümpelmann, Evelyn Gius, Andreas Hotho, Fo- Branden Chan, Stefan Schweter, and Timo Möller. tis Jannidis, Lucas Kaufmann, Markus Krug, Frank 2020. German’s next language model. In Proceed- Puppe, Nils Reiter, and Annekea Schreiber. 2021b. ings of the 28th International Conference on Com- Shared task on scene segmentation@konvens2021. putational Linguistics, pages 6788–6796. In Shared Task on Scene Segmentation. Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, and Dan Weld. 2019. Pretrained language models for sequential sentence classification. In Proceed- ings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th Inter- national Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 3693–3699, Hong Kong, China. Association for Computational Lin- guistics. Yin Cui, Menglin Jia, Tsung-Yi Lin, Yang Song, and Serge Belongie. 2019. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF conference on computer vision and pat- tern recognition, pages 9268–9277. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- ation for Computational Linguistics. 53