Breaking the Narrative: Scene Segmentation through Sequential Sentence Classification

Breaking the Narrative: Scene Segmentation through Sequential Sentence Classification MurathanKurfalı murathan.kurfali@ling.su.se Department of Linguistics Stockholm University Stockholm

Sweden

MatsWirén mats.wiren@ling.su.se Department of Linguistics Stockholm University Stockholm

Sweden

Breaking the Narrative: Scene Segmentation through Sequential Sentence Classification AE687298906ECBC236C6A111506216E1 GROBID - A machine learning software for extracting information from scholarly documents

In this paper, we describe our submission to the Shared Task on Scene Segmentation (STSS). The shared task requires participants to segment novels into coherent segments, called scenes. We approach this as a sequential sentence classification task and offer a BERTbased solution with a weighted cross-entropy loss. According to the results, the proposed approach performs relatively well on the task as our model ranks first and second, in official in-domain and out-domain evaluations, respectively. However, the overall low performances (0.37 F 1 -score) suggest that there is still much room for improvement.

Introduction

Scene segmentation is a novel task introduced in (Zehe et al., 2021a) that aims to divide long narrative texts, e.g. novels, into smaller coherent segments or scenes, as they are called. Scenes, in this context, can be roughly defined as "a segment of a text where the story time and the discourse time are more or less equal, the narration focuses on one action and space and character constellations stay the same" (Zehe et al., 2021a). 1 The task of scene segmentation is of great value on several ends: (i) it can be directly employed in several digital humanities tasks, e.g. plot reconstruction; (ii) segmenting longer texts into smaller coherent pieces help other NLP tasks, e.g. co-reference resolution, that struggle with texts longer than a couple of paragraphs (Joshi et al., 2020); (iii) as a novel task that requires high-level modeling of long texts, it offers itself as a valuable probing task to evaluate language models on long-context scenarios which is an active research area (Tay et al., 2020). 1 Interested readers are referred to annotation guidelines available at https://zenodo.org/record/4457177 for further details.

Our main interest in the current paper is to explore whether scene segmentation can be handled as a sequential sentence classification task. To this end, we follow the methodology proposed in Cohan et al. (2019), which encodes all sentences in a sequence jointly through BERT (Devlin et al., 2019) to directly leverage the contextual information from all tokens in the sequence at the same time. The model of Cohan et al. (2019) is further adapted to the task via introduction of a weighted cross-entropy loss in order to account for the imbalanced distribution of the labels in the dataset.

According to the official results, our model achieves the best performance on the in-domain texts, significantly outperforming the secondranking system. However, the performance drops when evaluated on out-of-domain novels, suggesting that the proposed methodology only poorly generalizes over different domains. We release our system to facilitate reproducibility and future work.2 2 System Overview

Task Details

The scene segmentation task can be framed in several ways. Within the shared task, it is defined as the identification of the boundaries that delimit the consecutive segments (Zehe et al., 2021b). The boundaries between segments are labeled according to the types of segments they delimit. Specifically, a boundary can belong one of the following three classes: Scene-Scene; Nonscene-Scene; Scene-Nonscene. 3The participating teams are evaluated only according to their success at finding and labeling these boundaries. That is to say, classification of an individual sentence as belonging to a Scene or a Nonscene means very little in the evaluations. Of the possible three transitions, Scene-Scene is the most common one as Nonscenes are significantly less frequent in data (see Table 1).

Our Model

We model scene segmentation as a Sequence Sentence Classification (SSC) task where the goal is to understand whether a given sentence is segmentinitial or not along with the type of segment it belongs to. Similarly to the more common token classification tasks, e.g. POS-tagging or NER, we employ the IOB2 format and assign a tag to each sentence. Specifically, we label segment-initial sentences (boundaries) as #X-B and other sentences as merely #X where X indicates the type of segment (Scene or Nonscene).

Our classifier closely follows the methodology proposed in Cohan et al. (2019). Here, the authors employed BERT to perform several document-level classification tasks, e.g. abstract sentence classification, where the aim is to classify sentences in a scientific abstract into their rhetorical roles such as introduction, method, etc. The rest of this section describes the model along with our modifications.

The proposed methodology follows the standard way of using BERT through fine-tuning on the target task but uses a novel input representation. The classifier used in the experiments is illustrated in Figure 1. As input, a sequence of N sentences is concatenated by BERT's special delimiter token [SEP], yielding one long sequence. This sequence, after the insertion of the standard [CLS] token at the beginning, is fed into BERT. However, unlike the standard way of using the [CLS] token as the representation of the input sequence, the representations of the individual [SEP] tokens are used as the representations of the sentences that precede them. Hence, instead of the [CLS] token, [SEP] representations are classified by a multi-layer feedforward network to reach labels.

The rationale for using [SEP] as sentence representation has to do with the next-sentence objective of BERT: "Intuitively, through BERT's pretraining, the [SEP] tokens learn sentence structure and relations between continuous sentences" (Cohan et al., 2019). During fine-tuning, the model is further primed to assign appropriate weights to [SEP] tokens to encode necessary contextual information for classification. Fine-tuning BERT in this way has the benefit of simultaneously leveraging the contextual information from all sentences in the sequence.

Loss function

The model is trained to minimize the cross-entropy loss between the probabilities over the possible labels computed using a softmax activation and the target distribution. However, during the initial experiments, we observed that the model severely suffered from the highly skewed label distribution, namely the low number of boundary sentences in comparison to nonboundary ones. 4 In order to mitigate this issue, following the previous studies (Rotsztejn et al., 2018;Cui et al., 2019;Yang et al., 2019), we introduce a weighting factor to the loss function where each class is assigned a weight that is inversely proportionally to their frequency in the training set:

weight c = i freq(i) freq(c)

where freq indicates the count of a certain class.

Overall, the weighted cross-entropy becomes

Loss W CE = − C c w c t c log(s c

) where w c is the weight, t c is the gold truth value (taking either 0 or 1), and s c is the corresponding Softmax probability of the class c.

Experimental Setup

Data

The dataset used in this shared task is based on an expanded version of the annotation effort introduced in (Zehe et al., 2021a) 1.

Parameter Setting

We follow the implementation of Cohan et al. (2019). 6 As the language model, we use the large German BERT model from (Chan et al., 2020) (dubbed GBERT-large7 ) due its superior performance over the existing German models. The batch size of 8 and gradient accumulation steps of 4 are used to reach effective batch size of 32. All experiments are run on a single V100 GPU. We set the learning rate to 5e-6 and the training is run for the maximum of 100 epochs with the early stopping applied (patience = 20) based on the performance on the development set. Due to BERT's inherit sequence size limit, we set a threshold of 25 sentences in each sequence which is chosen empirically (i.e., according to the performance on the in-house test set) among the set of {20, 25, 30, 50}.

Results and Discussion

The official evaluation is performed on two different test sets:

i. Test suite 1 focuses on in-domain evaluation and consists of 5 annotated dime novels,

ii. Test suite 2 focuses on out-of-domain evaluation and consists of 2 annotated contemporary high-literature texts.

Table 2 presents the breakdown of our results into each possible transition whereas the official ranking of the participating systems, according to the mean micro-averaged F 1 scores, is provided in Table 3. According to the official rankings, the proposed approach is good at segmenting in-domain novels and outperforms the second best system by some margin. However, the performance significantly drops when evaluated on out-of-domain novels, suggesting that the system generalizes poorly across domains.

According to Table 2, our model is best at recognizing Scene to Scene transitions; however, it is almost completely incapable of finding the borders between non-scenes and scenes. Suggested by the high-recall, low-precision scores, our model tends to over-segment the novels. On average, the system divides the in-domain novels into 1.76 and outof-domain novels into 1.61 times more segments. This tendency towards over-segmenting hints at over-sensitivity to certain markers which is further discussed in the next section.

Overall, the results clearly demonstrate that the task is extremely challenging even in the in-domain setting. The poor performance of solutions based on contextual embeddings8 highlight the need for novel architectures. One obvious drawback with BERT-based models is their inability to encode long sequences. Hence, a straightforward extension of the current model would be to employ a model which supports longer contexts, e.g. Longformer (Beltagy et al., 2020); however, such a model is unfortunately not available for German at the time of writing.

Error Analysis

In addition to the official evaluation, we performed a manual error analysis of our model's predictions on the in-house test set (see Section 3.1). One observation was that in certain cases, although the model correctly recognized the type of transition (e.g. Scene-Scene), it misplaced the boundary only by a single sentence. An instance of this can be seen in Example (1), where the predicted boundary appears just before sentence (1a), whereas the gold boundary appears just before the subsequent sentence (1b): 9

(1) a. Und bald darauf fuhr der Wagen aus dem Wald und einen allmählich ansteigenden Berg hinan.

(And soon afterwards the car drove out of the forest and up a gradually rising mountain.)

b. Dort oben lag das Schloss Treuenfels.

(Treuenfels Castle was up there.)

Furthermore, as mentioned in the previous section, the system tends to over-segment the novels. A manual inspection of false positives (sentences that are erroneously identified as segment boundaries) reveals that despite being incorrect, these predictions are not completely random. Most of the false positives involve an adverbial or other kind of phrase which signals a shift in time and/or place. Some cherry-picked examples are given in (5) Er wandte sich um und ging wieder zurück, bis in das Zimmer, wo der Schreibtisch der Gräfin Alice stand.

(He turned and went back to the room where Countess Alice's desk was.).

Similar to the behavior of the baseline system proposed in Zehe et al. (2021a), these examples highlight the model's sensitivity to the local cues rather than the larger context. That is, to a certain extent, the system makes its predictions according to the individual phrases that signal shifts in time or place, paying too little attention to the global context.

Conclusion

The current paper summarizes our submission to the Shared Task on Scene Segmentation (STSS). We handle scene segmentation as a sequential sentence classification task and offer a BERT-based solution. The proposed model achieves the best performance in the in-domain evaluations but falls short of transferring its performance across domains. Error analysis further reveals that the predictions are more sensitive to local cues rather than the global structure of the text, highlighting the need for better document-level modeling.

Figure 1 :1Figure 1: Overview of the system architecture. Each sentence is represented by the respective [SEP] token which is used to predict the label. Figure copied from Cohan et al. (2019).

Table 1 :1, and consists of 20 German novels in total, excluding the blind test sets Segment| |Avg. Sent| Count |Avg. Segment| |Avg. Sent| Characteristics of the train/dev/test splits used in model development. The numbers in the columns refer to number of segments, the average size of a segment (in terms of # of sentences) and the average size of a sentence (in terms of # of words) for Scenes and Non-scenes separately.SplitSceneNon-sceneCount |Avg. Training 1075 45.3310.585116.1115.39Dev12769.512.1275.618.20Test4638.168.05719.1411.15used in the official evaluations. During model de-velopment, we create custom development and testsets by randomly allocating one file for each, using the remaining 18 files for training. 5 The statisticsregarding the training/dev/test splits used duringmodel development are provided in Table

Table 3 :3Official rankings and results of all participating systems, according to the micro-averaged F 1 scores, averaged over all the novels in the corresponding suite. Results of our submission is highlighted in boldface.https://github.com/MurathanKurfali/scene segmentationUnlike Scenes, Nonscenes, naturally, are not distinguished from one another; hence, Nonscene-Nonscene is not a valid transition.Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons LicenseAttribution 4.0 International (CC BY 4.0).The most frequent label, Scene, single-handedly accounts for 96.1% of the training data.Files 9783740941093 and 9783732522033 are used the dev and test set, respectively.https://github.com/allenai/sequential sentence classificationhttps://huggingface.co/deepset/gbert-largeA BERT-based baseline in the original resource paper similarly fails on this task(Zehe et al., 2021a).In these examples, the system has predicted the shift immediately before the sentences displayed.

Acknowledgments

This work has been partly funded by an infrastructure grant from the Swedish Research Council (SWE-CLARIN, 2019-24; contract no. 2017-00626). We thank the Swedish National Infrastructure for Computing (SNIC) for providing computational resources under Project 2020/33-26.

IzBeltagy MatthewEPeters ArmanCohan arXiv:2004.05150 Longformer: The long-document transformer 2020 arXiv preprint German's next language model BrandenChan StefanSchweter TimoMöller Proceedings of the 28th International Conference on Computational Linguistics the 28th International Conference on Computational Linguistics 2020 Pretrained language models for sequential sentence classification ArmanCohan IzBeltagy DanielKing BhavanaDalvi DanWeld Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Hong Kong, China

Association for Computational Linguistics 2019 Class-balanced loss based on effective number of samples YinCui MenglinJia Tsung-YiLin YangSong SergeBelongie Proceedings of the IEEE/CVF conference on computer vision and pattern recognition the IEEE/CVF conference on computer vision and pattern recognition 2019 BERT: Pre-training of deep bidirectional transformers for language understanding JacobDevlin Ming-WeiChang KentonLee KristinaToutanova Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

Association for Computational Linguistics 2019 1 Spanbert: Improving pre-training by representing and predicting spans MandarJoshi DanqiChen YinhanLiu LukeDaniel S Weld OmerZettlemoyer Levy Transactions of the Association for Computational Linguistics 8 2020 Eth-ds3lab at semeval-2018 task 7: Effectively combining recurrent and convolutional neural networks for relation classification and extraction JonathanRotsztejn NoraHollenstein CeZhang Proceedings of The 12th International Workshop on Semantic Evaluation The 12th International Workshop on Semantic Evaluation 2018 Long range arena: A benchmark for efficient transformers YiTay MostafaDehghani SamiraAbnar YikangShen DaraBahri PhilipPham JinfengRao LiuYang SebastianRuder DonaldMetzler arXiv:2011.04006 2020 arXiv preprint Emotionx-ku: Bertmax based contextual emotion classifier KisuYang DongyubLee TaesunWhang SeolhwaLee HeuiseokLim arXiv:1906.11565 2019 arXiv preprint Detecting scenes in fiction: A new segmentation task AlbinZehe LeonardKonle LeaKatharina Dümpelmann EvelynGius AndreasHotho FotisJannidis LucasKaufmann MarkusKrug FrankPuppe NilsReiter Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume 2021a Shared task on scene segmentation@konvens2021 AlbinZehe LeonardKonle SvenjaGuhr LeaKatharina Dümpelmann EvelynGius AndreasHotho FotisJannidis LucasKaufmann MarkusKrug FrankPuppe NilsReiter AnnekeaSchreiber Shared Task on Scene Segmentation 2021b