-

Albin Zehe

zehe@informatik.uni-wuerzburg.de 0 1

Evelyn Gius

1 2 0 Equal Contribution 1 Leonard Konle 2 TU Darmstadt 3 University of Cologne

2021

This paper describes the Shared Task on Scene Segmentation1 STSS@KONVENS 2021: The goal is to provide a model that can accurately segment literary narrative texts into scenes and non-scenes. To this end, participants were provided with a set of 20 contemporary dime novels annotated with scene information as training data. The evaluation of the task is split into two tracks: The test set for Track 1 consists of 4 in-domain texts (dime novels), while Track 2 tests the generalisation capabilities of the model on 2 out-of-domain texts (highbrow literature from the 19th century). 5 teams participated in the task and submitted a model for final evaluation as well as a system description paper, with the best-performing models reaching F1-scores of 37 % for Track 1 and 26 % for Track 2. The results show that the task of scene segmentation is very challenging, but also suggest that it is feasible in principle. Detailed evaluation of the predictions reveals that the best-performing model is able to pick up many signals for scene changes, but struggles with the level of granularity that actually constitutes a scene change.

The objective of this shared task is to develop a model capable of solving the task of scene segmentation, as discussed by Gius et al. (2019) and formally introduced by Zehe et al. (2021). According to their definition, a scene can be understood as “a segment of a text where the story time and the discourse time are more or less equal, the narration focuses on one action and space and character constellations stay the same”. The task of scene segmentation is therefore a kind of text segmentation task applicable specifically to narrative texts (e.g., novels or biographies): These texts can be seen as a 1https://go.uniwue.de/stss2021 sequence of segments, where some of the segments are scenes and some are non-scenes. The goal of scene segmentation is to provide both the borders of the segments as well as the classification of each segment as a scene or non-scene. Solving this task advances the field of computational literary studies: the texts of interest in this field are often very long and can therefore not easily be processed with NLP methods. Breaking them down into narratologically motivated units of meaning like scenes would enable processing these units (semi-)individually and then aggregating the results over the entire text. In addition, a segmentation into scenes allows plotand content-based analyses of the texts. 2

Background: Scene Segmentation

This section provides an overview of the task of scene segmentation according to the definition by Zehe et al. (2021). For the full motivation and description, we refer to this paper.

From a narratological point of view, a scene can be defined by reference to a set of four dimensions: time, space, action and character constellation. Using these dimensions, a scene is a segment of the discours (presentation) of a narrative which presents a part of the histoire (connected events in the narrated world) such that (1) time is equal in discours and histoire, (2) place stays the same, (3) it centers around a particular action, and (4) the character constellation is equal. All of these conditions are not absolute but rather relative, that is, small changes in either of them do not necessarily lead to a scene change but can rather be seen as indicators.

Casting this definition as a machine learning task, we receive as input a (narrative) text and want to develop a model that (a) splits the text into a sequence of segments and (b) labels each of these segments either as a scene or as a non-scene. Depending on the realisation of the changes, there are strong or weak boundaries. Segments separated by a weak boundary can be aggregated into one segment, while segments with hard boundaries need to be considered separately. 3

Related Work

The related work for scene segmentation has been covered in much detail by Zehe et al. (2021). For completeness, we reproduce their overview here with only minor adaptation:

Segmentation tasks have been discussed in NLP for a while, mostly with the goal of identifying regions of news or other non-fictional texts discussing certain topics. The task of topic segmentation is then to identify points in the text where the topic under discussion changes. Early work to this end uses similarity of adjacent text segments (such as sentences or paragraphs) with a manually designed similarity metric in order to produce the resulting segments. One of the most well known systems of this manner is TextTiling (Hearst, 1997) , which was applied to science magazines. Similarity based on common words (Choi, 2000; Beeferman et al., 1999) was superseded with the introduction of Latent Dirichlet Allocation (Blei et al., 2003) , which allowed to segment the text into coherent text snippets with similar topic distributions (Riedl and Biemann, 2012; Misra et al., 2011) . This procedure was extended by the integration of entity coherence (John et al., 2016) and Wanzare et al. (2019) have used it on (very short) narrative texts in an attempt to extract scripts. Recently, many approaches making use of neural architectures deal with the detection and classification of local coherence (e. g. Li and Jurafsky, 2016; Pichotta and Mooney, 2016; Li and Hovy, 2014) , which is an important step for a text summarization of high quality (Xu et al., 2019) . Text segmentation using neural architectures was conducted on Chinese texts and it was shown that recurrent neural networks are able to predict the coherence of subsequent paragraphs with an accuracy of more than 80 % (Pang et al., 2019) . Lukasik et al. (2020) compare three BERT based architectures for segmentation tasks: Cross-Segment BERT following the NSP Pretraining-Task and fine-tuned on segmentation, a Bi-LSTM on top of BERT to keep track of larger context and an adaption of a Hierarchical BERT network (Zhang et al., 2019) .

Some work has been done on segmenting narrative texts, but aiming at identifying topical segments – which, as we have pointed out above, is different from scene segmentation. With a set of hand-crafted features, Kauchak and Chen (2005) achieve a WindowDiff score (Pevzner and Hearst, 2002) of about 0.5, evaluated on two novels. Kazantseva and Szpakowicz (2014) have annotated the novel Moonstone with topical segments, and presented a model to create a hierarchy of topic segments. They report about 0.3 WindowDiff score. Recently, Pethe et al. (2020) have introduced the task of chapter segmentation, which is similar to scene segmentation in that they both focus on narrative texts. However, it aims at detecting chapters, which are based on structural information like headers, whereas scenes are defined by features of the told story not directly connected to structural information. Notably, our dataset contains some scenes that cross chapter boundaries, since our characteristics of scenes are entirely independent of such formal markers. Most closely related to our task are the papers by Reiter (2015), who documents a number of annotation experiments, and Kozima and Furugori (1994), who present lexical cohesiveness based on the semantic network Paradigme (Kozima and Furugori, 1993) as an indicator for scene boundaries and evaluates their approach qualitatively on a single novel. However, neither of them provide annotation guidelines, annotated data or a formal definition of the task.

A related area of research is discourse segmentation, where the goal is also to find segments that are not necessarily defined by topic, and are also assigned labels in addition to the segmentation. There are annotated news corpora in this area featuring fine-grained discourse relations between relatively small text spans (Carlson et al., 2002; Prasad et al., 2008) . Although larger structures have been discussed in literature (Grosz and Sidner, 1986) , no annotated corpora have been released. 4

Shared Task on Scene Segmentation STSS

The Shared Task on Scene Segmentation was organised as one of the shared tasks of KONVENS 2021.2 There were a total of 8 registrations, out of which 5 teams submitted a model for the final evaluation as well as a system description paper. The task was split into two tracks, with the first one evaluating on in-domain data and the second one on out-of-domain data. The test data was kept back for the entire duration of the task and trained 2https://konvens2021.phil.hhu.de/ models were submitted to the organisers as Docker images for the final evaluation. 4.1

Data

Trial Data A single text, “Der kleine Chinesengott” by Pitt Strong (aka Elisabeth von Aspern) was released as trial data before the actual training set, in order to show the format of the dataset and enable participants to start working on their implementation as soon as possible.

Training Data The training data consisted of 20 annotated dime novels, which include the 15 texts from Zehe et al. (2021) as well as 5 new texts that were annotated according to the same guidelines. The texts are given in the appendix in Table 3, along with detailed dataset statistics (Table 4). Since the texts are protected by copyright, they could not be distributed directly. Instead, participants were asked to register for a German ebook shop3 and received the books as a gift on this website, along with standoff annotations and a script to merge the epub files with the annotations.

Evaluation Data

Track 1 The first subset of the evaluation data, used in Track 1 of the shared task, consisted of 4 texts from the same domain as the training set, that is, dime novels. Detailed statistics for this dataset are available in Table 5.

Track 2 The second evaluation set, used for Track 2, consisted of out-of-domain data, specifically 2 high-brow literature novels. This set, presented in detail in Table 6, was chosen to investigate how well the submitted approaches were able to deal with texts that are assumed to differ strongly from the training data in writing style. 4.2

Evaluation Metrics

Evaluating scene segmentation is a somewhat challenging problem in itself. Zehe et al. (2021) use two evaluation metrics, F1-score and Mathet’s γ (Mathet et al., 2015) , arguing that γ is the more suitable measure for scene segmentation: F1-score only counts a scene boundary as correct if it is predicted at exactly the right position, while an offset of one sentence would already count as a complete miss. On the other hand, γ tries to align the predicted boundaries with the gold boundaries and score both the fit of the alignment as well as the classification into scene and non-scene. However, since the γ measure itself requires the user to specify certain parameters and it is not immediately obvious how to set these parameters in our context, the main evaluation in this shared task is based on the exact F1-score. More precisely, we represent the segmentation produced by each system as a list of boundary predictions: Each sentence in the text is labelled as either NOBORDER, SCENE-TO-SCENE, SCENE-TO-NONSCENE or NONSCENE-TO-SCENE. For example, a sentence that starts a new scene after a segment that is classified as a non-scene would be labelled as NONSCENE-TO-SCENE. This classification can then directly be compared to the gold standard annotations.

The classes in this scheme are highly imbalanced, with NOBORDER making up the vast majority of the labels. Therefore, for our main evaluation, we exclude the label NOBORDER and build micro-averaged scores between the other classes. We chose to use micro-averaging despite the class imbalance since the minority classes are not more important to the classification and therefore microaveraged scores lead to a better representation of the overall classification performance.

For informative reasons, we also report the γ score of the approaches. 4.3

Submitted Systems

This section provides an overview of the approaches to scene segmentation submitted by the participants of the Shared Task.

Kurfali and Wire´n (2021) apply the sequential sentence classification system proposed by Cohan et al. (2019) to the scene-segmentation task. This system is based on BERT, but uses a customised input format, where each sentence of the input sequence is separated by BERT’s special token “[SEP]”. After passing a sequence through BERT, the output of those “[SEP]” tokens is fed into a multi-layer perceptron to predict a label for its preceding sentence. While the original system utilises a mean-squared-error loss, Kurfali and Wire´n (2021) implement weighted cross-entropy to deal with the class imbalance in the scene dataset and make use of the IOB2 scheme instead of simple classification with categories.

The system submitted by Gombert (2021) builds on the idea to use sentences functioning as scene borders as feature vectors for the prediction of 4.4

Evaluation of the Automatic Systems for Scene Segmentation

scene borders. For this purpose, first a sentence embedding space is learned in a twin BERT training setup. The model separates sentences functioning as scene borders from sentences within scenes. In a second step, a gradient-boosted decision tree ensemble is fed with feature vectors from the sentence embeddings generated by the model.

In the following, we present and discuss the performance of the submitted systems in our shared task.

All results are summarised in Table 1.

The most successful system on Track 1 was the one proposed by Kurfali and Wire´n (2021), rea

The system submitted by Barth and Do¨nicke ching an F1-score of 37 % on the evaluation set (2021) focuses on the manual design of vectors co- for Track 1. For Track 2, their model was somevering different sets of features for scene segmen- what less successful, reaching an F1-score of 17 %, tation. The first set consists of general linguistic which still corresponds to the second place. On features like tense, POS tags, etc. The other sets fo- Track 2, the system proposed by Gombert (2021) cus on features crucial for the scene segmentation performed best, with an F1-score of 26 % (16 % on task, explicitly encoding temporal expressions as Track 1). All results for both systems, with evaluawell as entity mentions. These feature vectors are tion for all border classes on individual texts, can then used as input to a random forest classifier. be found in the appendix in Tables 7 and 8. Overall,

The system of Hatzel and Biemann (2021) casts these results show that scene segmentation is a very the problem of scene segmentation as a kind of challenging, but not impossible task. Especially the next-sentence-prediction: It focuses on the “[SEP]” winning system is capable of finding 51 % of all tokens which appear in between two subsequent annotated scene boundaries in the in-domain data, sentences in the input representation for a BERT which is a promising score. The bigger issue of this model, and uses their embedding representation system at the moment seems to be the precision from a German BERT model. In addition to the (29 %), indicating that many of the boundaries the BERT-embeddings, the authors add manual featu- systems predicts are wrong. We provide an analysis res capturing changes in the character constellation of what leads to these results in the next section. that are derived from a German adaptation of the Interestingly, all systems except the one from coarse-to-fine co-reference architecture (Lee et al., Kurfali and Wire´n (2021) actually performed better 2018). This final representation is fed into a fully on the out-of-domain evaluation set of Track 2 than connected layer with a softmax activation function on the (in-domain) dime novels of Track 1. Howein order to detect scene changes. Since this ap- ver, it must also be noted that the scores are overall proach predicts too many scenes in close proximity, somewhat low and the differences should therefore they evaluate different ways to suppress neighbou- not be overinterpreted. We can also see that the ring scenes for their final prediction. Specifically, ranking according to the γ measure would be ratthey use a cost function which punishes very short her similar to the F1-score-ranking. However, there scenes harshly. are also differences in the ranking, for example the system submitted by Hatzel and Biemann (2021) would have been ranked higher in both tracks according to γ. This shows that the selection of a fitting evaluation measure for scene segmentation is indeed important.

The team Schneider et al. (2021) present the “Embedding Delta Signal” as a method for both scene segmentation and topic segmentation. They focus on context change in documents using a sliding window method that compares cluster assignments of word embeddings using the cosine distance measure and detect scene changes by searching for local maxima in the signal. In a further step, they distinguish between different scene types using a simple support vector machine approach with hyper-parameter search. They use an additional evaluation method, intersection over union of predicted and actual scenes, arguing that this measure is more suitable because it punishes scene boundaries that are in the vicinity of the gold annotations less severely than the F1-score.

Additional Evaluation

Addressing the fact that our F1-score is a very unforgiving measure, since only exact matches are counted as correct scene boundaries, we performed some additional evaluation on the predictions by the different systems.

As a first step, we noticed that some of the systems had a tendency to predict multiple short scenes in the vicinity of a hand-annotated scene change. Therefore, we conducted an additional evaKurfali and Wire´n (2021) Gombert (2021) Barth and Do¨nicke (2021) Hatzel and Biemann (2021) Schneider et al. (2021) Prec. luation where we merged scenes that were less than 5 sentences long to the preceding or following scene, if this led to a correctly predicted scene (e.g., if the beginning of the short scene was a gold scene boundary and the end of the next scene was a gold scene boundary, the two scenes were merged). This improved some of the scores by up to 3 percentage points in F1-score. Note that this is not a “valid” evaluation, since the decision whether to merge to the preceding or following scene is taken based on the gold standard. However, it does show that correct handling of short scenes would have some positive influence on the results.

Additionally, we analysed whether we could determine especially “important” scene boundaries more reliably. To this end, the existing annotations of Track 1 were re-edited: Annotators were asked to identify strong and weak boundaries between the previously annotated scenes, depending on how they judged the importance of each boundary. A strong boundary is one that must be set in any annotation, while a weak boundary is one that may be omitted based on the desired level of granularity. Note that we did not collect any additional scene annotations, but only categorised the existing ones further. We did not see significant changes in the performance when considering only strong boundaries. In particular, the recall was not consistently higher than for all scene boundaries. 5

Manual Error Analysis

In this section, we provide a deeper analysis of the prediction errors that the best-performing system on Track 1 (Kurfali and Wire´n, 2021) makes. To this end, we manually analyse the predictions and potential error sources on two texts from Track 1: • Hochzeit wider Willen (Wedding against will, the text with the best γ score) • Bomben fu¨r Dortmund (Bombs for Dortmund, the text with the second worst γ score; we decided not to use the text Die Begegnung, which has the worst γ score, since it was a very hard text even for the annotators)

Table 2 compares the manual to the automatic annotations for these texts. The analysis reveals that the following factors have a particular influence on the predictions: (a) Length of the detected scenes or granularity of scene detection in general (b) explicit markers of time and space changes, (c) changes in character constellation (entrance and exit of characters, especially protagonists), (d) naming and description of newly introduced characters (full name plus verb sequence), as well as (e) end of dialog passages. We provide a brief overview of the problematic factors here and refer to Appendix C for a detailed analysis with specific examples. 5.1

Analysis of Markers

First, we investigate how the markers used in our definition of scenes influence the system’s decisions regarding scene borders.

Time Markers The system clearly seems to have identified time markers as an important signal for scene changes. Many false positives (scene borders annotated by the system, but not by the human annotators) start with temporal markers, especially the word “as”. Overall, the system appears to have overgeneralised the impact of temporal markers, seeing every mention of time in the text as a strong signal for a scene change.

Location Markers A similar issue arises with the presence of location markers: the system is very sensitive to changes in action space, often producing false positives at the mention of locations. According to our annotation guidelines, only significant location changes induce a scene border while, for example, moving through rooms in a house is not necessarily cause enough for a scene change. total length (tokens) longest gold standard scene (tokens) longest correctly predicted scene (tokens) annotated segments by winner system annotated segments in gold standard Bomben fu¨r Dortmund

Hochzeit wider Willen

Changes in Character Constellation Another marker that our scene definition takes into account is the character constellation. We find that the model is capable of identifying the introduction of a new character, often accompanied with the character’s full name as well as a short description, as a marker for a new scene. However, once again the system seems to struggle with the importance of character constellation changes, showing a tendency to start a scene for every introduction. Dialogue Passages Dialogue passages are not explicitly part of our scene definition, however it is reasonable to assume that they can be valuable markers for scenes: for one, dialogues appear almost exclusively in scenes, rarely in non-scenes. Additionally, a new scene usually does not start in the middle of a dialogue passage. The model seems to have picked up this fact, since it has a tendency to predict scene changes on the end of dialogue passages. While this can be a valid marker, it again leads to false positives in the system’s output. 5.2

General Issues of the Model’s Output

Here, we attempt to extract a generalisation of the specific issues described before. They can be grouped into two major categories: issues with scene length and issues with the granularity of markers. Scene Length One of the most general problems was that the system predicts very short scenes in succession, often caused by the occurrence of multiple markers within a few sentences. In our manual annotations, very short passages are usually not considered as separate scenes, but rather as part of the preceding or following scene. The system does not appear to have learned this and therefore often predicts multiple very short scenes in succession. Granularity of Markers An issue that was noticeable for any of the markers discussed above is the system’s apparent inability to infer the importance of a scene change marker. Many false positive predictions are caused by small changes in time, place or character constellation that were not considered as significant enough for a scene change by the annotators. In some cases, the model’s decision to predict a scene change is perfectly reasonable and can be seen as a more fine-grained scene segmentation than the one agreed on in our annotations (cf. Section 6). In other cases, however, the oversensitivity of the system is clearer, as for example with the temporal marker “as” (see above). 6

Discussion

In this section, we briefly discuss the results of the shared task along with possible next steps towards the improvement of automatic scene segmentation.

The winning systems for both tracks (Kurfali and Wire´n, 2021; Gombert, 2021) are based on BERT variants, showing that, as for many other NLP-tasks, pre-trained Transformer models are very valuable for scene segmentation. However, the results also reinforce our belief that scene segmentation cannot be solved by BERT alone, but requires a deeper understanding of the text. Some of the submissions of the shared task explore alternative ways of approaching scene segmentation, either adapting methods from co-reference resolution (Hatzel and Biemann, 2021) , handcrafting features that are assumed to be helpful for scene segmentation (Barth and Do¨nicke, 2021) , or using differences in the text over time to derive scene change candidates (Schneider et al., 2021) .

The two most consistent sources of errors in the most successful model are the granularity of scene change markers and the length of scenes. Both of these problems should be – at least in part – addressable by introducing additional constraints or signals to the model. For the scene length, it seems promising to make the model aware of the length of the current scene, which could prevent it from predicting many short scenes, or to use global information about the scene boundaries. A possible approach to this has been used by Pethe et al. (2020) for the related task of chapter segmentation and was also applied in this shared task by Hatzel and Biemann (2021) with some success.

For the problem of granularity, the model could be given access to explicit information regarding the scale of the markers. For example, information from knowledge graphs about the scale of temporal markers or location changes could be useful (e.g., a minute is much less relevant than a month; a different room is much less relevant than a different country). Character changes appear to be more challenging in this regard, since the model needs to be able to judge the importance of a character for the current scene. This might be achieved by applying co-reference resolution to the texts and building a local character network, representing how many interactions each character has with others in the neighbouring text, how often they are mentioned, etc. Although a somewhat boring solution, using more training data might also enable the model to learn the granularity of markers, at least for location and temporal markers. A possible step in this direction is to use the related task of chapter segmentation (Pethe et al., 2020) , for which a large amount of weakly labelled training data is available, for pre-training and then fine-tuning the resulting model for scene segmentation. While chapters and scenes are different in principle (cf. Section 3), they may be similar enough to make this pre-training step promising. On the other hand, it might be interesting to explore the scene segmentations provided by the model further. Our annotations represent our understanding of a scene, however other applications may require a more fine- or coarse-grained definition. To this end, it seems promising to optimise a model for recall (i.e., detect as many annotated scene borders as possible) in a first step and then filter these candidates for the desired level of granularity in a second step.

One of the most surprising results of the shared task is the fact that most models perform better on the out-of-domain high-brow literature than the in-domain dime novels. This is in stark contrast to our previous intuition, for two reasons: First, the training data consists of dime novels, which should lead to a model that is better suited to this type of texts. Secondly, from a literary perspective, we expected the high-brow literature to be more challenging to understand and therefore the scene segmentation to be more difficult. However, the more implicit style of writing in high-brow literature may actually be helpful for the models here. While dime novels often present explicit references to characters, locations or the passing of time, high-brow literature may use these references much more sparsely, making them more reliable markers of scene changes. Although the number of data points is too low to make a reliable statement, the higher precision of predictions from Gombert (2021) on the high-brow texts compared to the dime novels (cf. Table 8) might point in a similar direction.

Finally, we also see that the choice of evaluation measure is important, as F1-score and γ lead to different rankings in both tracks. For this shared task, we have decided to use the exact F1-score as the main measure, however this decision is not final. As already discussed before, measures that take into account the proximity of predicted to gold standard scenes, like γ, are equally valid, albeit more difficult to interpret. Schneider et al. (2021) propose a third potentially useful measure, intersection over union. While this measure would have to be adapted to be able to handle both non-scenes and scenes, this is also a promising direction. 7

Conclusion

In this paper, we have summarised the results of the Shared Task on Scene Segmentation, where the objective was to develop a method for automatic scene segmentation in literary narrative texts. To this end, we provided a training set of 20 dime novels and evaluated the submitted systems on two tracks, one with in-domain data and one with out-of-domain data in the form of high-brow literature. Overall, our shared task has received five submissions with very different approaches to scene segmentation. While none of these systems were capable of solving the task completely, especially the best performing systems for each track yielded promising results, with F1-scores of 37 % on Track 1 and 26 % on Track 2, respectively. These results show that scene segmentation remains challenging, but also that it is not an impossible task. In manual analysis, we discovered that the models are capable of picking up many important markers for scene boundaries, but sometimes still struggle to draw the correct conclusions from these markers.

Acknowledgements

We would like to thank all participants for their submissions. We are especially happy about the wide range of completely different and orthogonal approaches, opening great possibilities for future work on this challenging task!

Dataset Information

Title

Author Bezaubernde neue Mutti Widerstand zwecklos Tausend Pferde Der Turm der 1000 Schrecken Die Widows Connection Ein su¨ndiges Erbe Immer wenn der Sturm kommt Wechselhaft wie der April Lass Blumen sprechen Prophet der Apokalypse Verschma¨ht Der Sohn des Kometen Ein Weihnachtslied fu¨r Dr. Bergen Die hochmu¨tigen Fellmann-Kinder Als der Meister starb Hetzjagd durch die Zeit Wir schaffen es - auch ohne Mann Griseldis Deus Ex Machina Die Abrechnung Number of Segments Percentage of Scenes Number of Sentences Number of tokens Avg. Scene Length (Tokens) Im Bann der Vampire Bad Earth Hochzeit wider Willen Bomben fu¨r Dortmund micro avg macro avg weighted avg micro avg macro avg weighted avg micro avg macro avg weighted avg prec. micro avg macro avg weighted avg micro avg macro avg weighted avg micro avg macro avg weighted avg (a) Im Bann der Vampire (b) Die Begegnung (c) Hochzeit wider Willen (d) Bomben fu¨r Dortmund (e) Aus guter Familie (f) Effi Briest 0.25 0.00 0.00 micro avg macro avg weighted avg micro avg macro avg weighted avg micro avg macro avg weighted avg micro avg macro avg weighted avg micro avg macro avg weighted avg micro avg macro avg weighted avg prec. (c) Hochzeit wider Willen (d) Bomben fu¨r Dortmund (e) Aus guter Familie (f) Effi Briest C.1

Detailed Manual Analysis Time markers

As a starting point for the error analysis, the actual markers for scene changes, as known from the guidelines (Gius et al., 2021) , were considered separately: changes of narrated time, place, action and character constellation. Thereby, an overgeneralisation of time markers could be detected in the output of the winner system. It is noticeable that many annotated scenes start with formulations like “as”, “it was five over”, “at this moment”, “three minutes elapsed”, indicating changes in the time of the narrative. Especially many falsely annotated scene changes (false positives) begin with the temporal indicator “as”. The following passage shows an example of a wrong scene change indication triggered by the temporal conjunction “as”, marking a change in the narrated time. According to the gold standard, this passage does not include a scene change.

’If there really was something to the call, the colleagues in the radio patrol car might still be able to catch the man who had buzzed me out of my sleep. I got up and went to take a shower. A cold one would have been best now. But I wasn’t brave enough to do that yet. It was five over. Fat Peter Steiner, the owner of the bar Steinkrug, had by all appearances put not only rat poison but also a strong sleeping pill in the grain. As I got dressed and was about to leave the apartment, the phone rang again. ’Mattek?’ ’Speaking.’ ’Did you get the message through to the alarm center?’ ’Yes.” (German original text in Figure 1)

In this example, the short reflective passage containing the first-person-narrator’s thoughts about the night before interrupts the narrated action, which is resumed with the words “ As I got dressed and was about to leave [...]”. The temporal conjunction “as” could have caused the system to indicate a scene change, whereas according to the gold standard there is no scene change. This indication of a scene change may have resulted from overgeneralisation of the system. The use of temporal markers as indicators of probable scene changes is often successful, but risks an over-sensitive system.

However, not only temporal conjunctions seem to trigger the system to indicate scene changes, but also multi word expressions containing information on the narrated time, as can be seen in the following example, again from Bomben fu¨r Dortmund, in which a new scene was indicated differently to the gold standard annotation. Looking at this example, the question of granularity arises that will be encountered in the later subsection C.6.

’There was no need to hurry. Regarding the station, we had everything under control. Sure, there were loopholes to escape, but someone who didn’t even suspect being expected had no reason to look for them and use them. I simply assumed that Jutta Speißer didn’t have the faintest idea that we knew practically everything about her. Two, maybe three minutes passed. ’Can you hear me, Hermann?’ I had hidden the walkie-talkie under my leather jacket so that I could talk into it if I lowered my head a little. ’Yes.” (German original text in Figure 2)

Nevertheless, there are also many passages containing temporal markers that the system correctly indicated as new scenes. Another example from Bomben fu¨r Dortmund shows how it detected the scene change without requiring the temporal marker to be at the beginning of the sentence. ”May be,’ I said, ’[...]. One devilish lady, one big bastard, and the third bomb we know about. We’re going back to the station. Lampert and Blechmann will report there.’ The feeling of being watched faded when she left the train in Brackel and the long, skinny man who had caught her attention on the train was no longer behind her. But she had quickly calmed down.’ (German original text in Figure 3)

Although this correct detection of the scene change could also be related to the simultaneous occurrence of a change of the space of action, which will be discussed in more detail in the next section. C.2

Change of action space

Another possible overgeneralisation of the system could be its hypersensitivity to descriptions of the action space, since many scene changes annotated by the system happen to be accompanied by references to changes in the action space at the beginning of a new scene.

The following passage is an example from Bomben fu¨r Dortmund of a correctly annotated scene change followed by an indication of a change in the action space.

’I nodded to DAB. ’Give me your walkie-talkie, DAB. Get one from another officer.’ He didn’t expect anything from it, it was clear from his face, but he gave me his walkie-talkie. I disappeared from track one and walked through the underpass to the stairs leading up to track three. There was no sign of Tin Man. Nor was there any sign of the person he had described. That meant they must already be upstairs. I stopped in the middle of the stairs, lit a fresh cigarette and waited.’ (German original text in

Figure 4)

In addition to many true positive scene changes that the system recognises as in the previous sample passage, there are also many false positives that can be interpreted as the result of the system’s overgeneralisation. One example can be found in the following sample passage, in which the main characters do not move but an action outside of the scene setting is described that probably triggered the annotation of a wrong scene change within that scene. There is no scene change according to the gold standard.

”You’d make a great cop chick,’ I said, ’I used to be in the Scouts.’ Outside, in the small reception hall, someone pounded on the bell as I did. ’I don’t have time now, have to take care of the guests and sell rooms, or I’ll be out of a job. At nine?’ ’You bet!” (German original text in Figure 5)

Since the system generally tends towards finegrained scene segmentation, it is not surprising that it often annotates scene changes too much in addition to some actual scene changes. The following passage shows an example of fine-grained scene annotation by the system. In the passage, the main characters move from the hotel reception to the kitchen in the next room. For the manual annotation process, this change of action space is a prototypical example of the application of the container principle defined in the annotation guidelines (Gius et al., 2021, 4) . This principle is used to summarise short scenes without clear scene change indicators, e.g., when the characters remain the same and the change from one action space to another is described, while the settings are close to each other and often in the same building, as it is the case in this sample passage. Nonetheless, this scene change could be reasonable with the goal of more finely granulated scene annotation. These considerations have inspired us to look more closely at the distinction between weak and strong boundaries, which we analyse in subsection 4.5.

”Free choice. You won first prize with me.’ ’What would the second have been?’ ’A washing machine.’ ’I’d rather have the first,to be honest. I finish at eleven.’ ’Then the choice of fine venues is very limited.’ ’Pull strings,’ she said. I followed her into the small, white-tiled kitchen, where breakfast was also made for the guests. The sight of her made me look forward to the evening.’ (German original text in Figure 7)

Another recurring phenomenon that often triggers a change of scene is a character entering or exiting a scene. As in the following example from Bomben fu¨r Dortmund that contains a collection of typical verbal phrases, the exiting and reentering of a character is introduced by the indication of a character’s movement from one to another location by the phrases ’to leave’, ’to go back to’, ’to turn into’, and ’to disappear into’. However, according to the gold standard, the scene change should be displayed before the beginning of the sentence ’It was still raining cats and dogs’ to mark the beginning of the new scene outside the restaurant. Probably due to oversensitivity, the winning system annotated two scene changes instead of only one as in the gold standard, also missing the actual position of the scene change.

’She flus hed the toilet as a cover, left the cabin and washed her hands. Then, in front of the large, clean mirror, she fixed her frayed hair, which had nothing to be fixed. Then she went back to the restaurant. She drank the rest of the ouzo left in the glass, smiled at Dimitri, the owner, secretly wished him all known and unknown venereal diseases, preferably all at once, left an appropriate tip and left the restaurant. It was still raining cats and dogs. In the reflec tion of some lanterns, the rain looked like many cords next to each other, which did not tear and did not come to an end. It was just before seven when she turned into Karl Marx Street, crossed it and disappeared into Rubel Street. Out of the street stood the green Sierra.’ (German original text in

Figure 6)

As has become clear in this subsection, characters and their entrances and exits play a significant role in automatic annotation as markers of a likely scene change. In the following subsection, we will discuss another phenomenon related to characters that often coincides with scene change annotations, namely changes in character constellation. C.3

Change in Character Constellation

Another marker that frequently occurs at the beginning of automatically detected scenes is the introduction of a new character with the respective full name as well as the accompanying description of the character, its state or an action (presented as a combination of full name plus verb sequence). It can be concluded that the system has learned that this combination occurs frequently at scene beginnings. However, the following two examples (German original text in the appendices 8 and 9) show that this is not always the case.

The first passage is an example for the correct detection by the winner system of a new scene beginning with an introduction of a new character from Bomben fu¨r Dortmund.

’At nine I had an appointment with Marlies. Lohmeyer couldn’t ruin it for me. Jutta Speißer ate stifado and drank Cypriot Aphrodite wine. For Dimitri, the owner of the Greek restaurant ’Akropolis’ on Karl-Zahn-Straße, she was a new, welcome guest.’ (German original text in

Figure 8)

The second passage is an example of scene annotation differing from the gold standard, that contains the introduction and description of three new characters.

’Baldwein started the green Sierra. He slowly steered the vehicle past the post office and drove in the direction of Hoher Wall. Although Police Sergeant Werner

Okker had not been drinking last night, because of the duties he had to fulfillas now officially to his he looked bad. He was sitting at the counter of the Steinkrug. His angular, broad shoulders slumped forward in a tired manner. He seemed to be visibly struggling to lift his beer glass. Susanne Steiner stood behind the bar. Large, coarse-boned, Nordic. A girl who had grown up in the pub milieu. She had long, brunette hair and a decidedly beautiful face with full, sensual lips. Peter Steiner, her father, who was standing next to her at the tap, was not at all like her. He was around sixty. A former tusker.’ (German original text in Figure 9).

According to the gold standard, there is only one scene change in the text before ” Although Police Sergeant Werner Okker [. . . ]“, which was also detected by the automatic system. In addition to this, however, another scene change was indicated at the introduction of the new character Peter Steiner. It is noticeable that the constructions around the introduction of the character Susanne Steiner and the character Peter Steiner are similar in structure, but the sentence introducing Susanne Steiner was not recognized as the beginning of a new scene.

Another example of an incorrectly marked new scene which coincides with the introduction of a new character can be found in Hochzeit wider Willen. According to the gold standard, there is no scene change in the following passage.

’It was a warm morning at the beginning of August, the sun was shining golden in the breakfast room of the town palace. Here the Hohenstein family had gathered for the firstmeal of the day. Fu¨rst Heinrich, head of the family and chairman of the Hohenstein Bank, a traditional house in the Frankfurt financial center, was talking lively with his elder son Bernhard.’ (German original text in Figure 10)

Since similar constructions occur in the text Hochzeit wider Willen and can be found at the beginning of scenes detected by the system (like in Figure 10), it could be determined that this is not a singular phenomenon that occurs specifically in the text Bomben fu¨r Dortmund. It is also noticeable that the end of an automatically detected scene is often accompanied by the end of a dialogue passage, which is then followed by a descriptive passage that represents the beginning of a new scene.

The following example from Hochzeit wider Willen shows a passage which the winner system segmented into four different scenes, indicating a scene change after every ending of a dialogue passage followed by a descriptive passage without any dialogues. However, according to the gold standard, there is only one scene change in the passage before ” Prince Frederik appeared in his office a little later than usual that morning“.

’Frederik gazed pensively into his coffee. ’Well, someday I’ll get myself a lovely wife and a few offspring, but I still have a bit of a reprieve. Let’s say ten to fifteen years . . . ’ ’You’ve got a lot of nerve.’ The princess laughed and stood up. ’You don’t really believe that.’ ’Oh yes I do,’ he murmured and smiled narrowly. ’I know that.’ Prince Frederik appeared in his office a little later than usual that morning. Carina Bo¨ttiger, his secretary, was used to this and also knew what state her boss was in on such days. The petite blonde with the sky-blue eyes had strong coffee and aspirin ready. She brought both together with the signature folder. [...]. ’You look lovely today,’ Frederik noted, glancing at her dress. He eyed her rather thoughtfully for a moment, and she pretended not to notice, just thanking him artfully for the compliment and asking if there was anything else she could do for him. ’No, that was all for the moment.’ He gave her back the signature folder. ’When Herr von

Solm comes, send him right through.

I have something else to discuss with him.’ He noticed her slightly s u¨ffisant look, so he clarified: ’Something business-related.’ Carina laughed slightly and left the executive room. The fact that Frederik had noticed her new dress made her happy. Until now, she had always believed that he hardly had an eye for her. But she didn’t want to get any ideas about that either. After all, it seemed clear that this man was out of her reach. And she was really too good for a brief fling with the ladies’ man.’ (German original text in Figure 12).

One possible interpretation of this regular annotation of a scene change as a separation of dialogue and descriptive passages could be that the system recognises these passages as different writing styles, leaving the actual reasons for scene changes unconsidered.

C.5 However, the most common error, which was also the easiest to spot, was in the output of scenes that are only one to three sentences long as in the example from Hochzeit wider Willen ’One could see from the mother’s face that this was not necessarily the case. But Hedwig sensed that she would not receive any more information from Carina. ’My little princess ...’ That was what she had called Carina as a child. None of them could have imagined that she would ever become a real princess. And if the young woman was honest, she still couldn’t quite believe it now. A little later, the bride and groom left for the airport. Ewald Bo¨ttiger asked his wife: W´hat did you have to talk about for so long? Everyone was waiting for you’. ’I’m not sure if Carina married the right guy[...].’ (German original text in Figure 11)

In the manual scene annotation following the guidelines by Gius et al. (2021), the decision was made to append very short scenic passages to the appropriate preceding or following scene in the sense of the container principle. In the given example, however, there is no scene change at all, because it is only a description of the exit of the characters, which takes place within the scene at the bride’s parents’ house.

C.6

Granularity of Scenes

With respect to the length of the individual passages that should be detected as scenes, there is also the question of how granular the segmentation into scenes should be without becoming too small-scale. The following passage from Hochzeit wider Willen is an example of a small-scale, granular scene segmentation choice by the winning system, in which three scenes were indicated while there are only two according to the gold standard.

’Prince Frederik was quite pleased with himself. Carina had swallowed his excuse whole. She had thus given him a free pass, so to speak, to finally go back to living the way he liked. And he was determined to do so immediately ... The very next evening, Frederik called his wife to let her know that it was getting late. He was supposedly waiting for the conclusion of a lucrative business deal. Carina did not suspect anything - yet. When she asked him the next morning when he had come home, he did not tell her the truth.’ (German original text in Figure 13) ’Prince Frederik was quite pleased with himself. Carina had swallowed his excuse whole. She had thus given him a free pass, so to speak, to finally go back to living the way he liked. And he was determined to do so immediately ... The very next evening, Frederik called his wife to let her know that it was getting late. He was supposedly waiting for the conclusion of a lucrative business deal. Carina did not suspect anything - yet. When she asked him the next morning when he had come home, he did not tell her the truth.’ (German original text in Figure 14)

In this text passage, the choice of the automatic system to recognize another scene is not an implausible one. On the contrary, the system’s decision can be justified, but the small-scale granularity of scene annotation should be avoided in view of the overall goal of the segmentation task, in which a text is to be segmented into units of meaning in terms of content, which should exceed a minimum token length for their further use. The system was more precise than the gold standard.

C.7

German original text of the sample passages ’Falls an dem Anruf wirklich etwas dran war, konnten die Kollegen im Funkstreifenwagen den Mann vielleicht noch stellen, der mich aus dem Schlaf gebimmelt hatte. Ich stand auf und ging unter die Dusche. Eine kalte wa¨re jetzt am besten gewesen. Aber dazu war ich noch nicht mutig genug. Es war fu¨nf vorbei. Der fette Peter Steiner, der Wirt vom Steinkrug, hatte allem Anschein nach nicht nur Rattengift, sondern auch ein starkes Schlafmittel in den Korn gepanscht. Als ich mich angezogen hatte und die Wohnung verlassen wollte, la¨utete das Telefon erneut. ’Mattek?’ ’Am Apparat’. ’Haben Sie die Meldung an die Alarmzentrale durchgegeben?’ ’Ja.” Figure 1: Example from Bomben fu¨r Dortmund of a wrong scene change indication triggered by the temporal conjunction ’als’ marking a change in the narrated time. ’Es bestand kein Grund zur Eile. Was den Bahnhof anging, so hatten wir alles unter Kontrolle. Sicher gab es Schlupfl o¨cher zum Entkommen, aber jemand, der nicht einmal ahnte, dass er erwartet wurde, hatte auch keinen Grund, danach zu suchen und sie zu benutzen. Ich ging einfach davon aus, dass Jutta Speißer nicht den blassesten Schimmer davon hatte, dass wir praktisch alles u¨ber sie wussten. Zwei, vielleicht drei Minuten verstrichen. ’Kannst du mich ho¨ren, Hermann?’ Ich hatte das Walkie-talkie so unter der Lederjacke verborgen, dass ich hineinsprechen konnte, wenn ich den Kopf etwas senkte. ’Ja.” ”Viel leicht’, sagte ich. ’[...]. Eine teuflische Lady, einen großen Schweinehund und die dritte Bombe, von der wir wissen. Wir fahren ins Revier zuru¨ck. Lampert und Blechmann werden sich dort melden.’ Das Gefu¨hl, beobachtet zu werden, schwand, als sie in Brackel den Zug verließ und der lange, du¨rre Mann nicht mehr hinter ihr war, auf den sie im Zug aufmerksam geworden war. Aber sie hatte sich schnell wieder beruhigt.’ ’Ich nickte DAB zu. ’Gib mir dein Walkie-talkie, DAB. Hol dir eins von einem anderen Beamten.’ Er versprach sich nichts davon, das war ihm deutlich anzusehen, aber er gab mir sein Walkie-talkie. Ich verschwand von Gleis eins und lief durch die

Unterfu¨hrung bis zur Treppe, die nach Gleis

drei hinauffu¨hrte. Von Blechmann war nichts zu sehen. Von der Person, die er beschrieben hatte, ebenfalls nicht. Das hieß, sie mussten schon oben sein. Ich blieb mitten auf der Treppe stehen, zu¨ndete mir frische Zigarette an und wartete. ”Du wa¨rst eine prima Polizistenbraut’, sagte ich. ’Ich war mal bei den Pfadfindern.’ Draußen, in der kleinen Empfangshalle, ha¨mmerte jemand auf die Glocke, wie ich es getan hatte. ’Ich habe jetzt keine Zeit mehr, muss mich um die Ga¨ste und Zimmer verkaufen, sonst bin ich meinen Job los. Um neun?’ ’Worauf du dich verlassen kannst!” ’Sie spu¨lte zur Tarnung, verließ die Kabine und wusch sich die Ha¨nde. Anschließend richtete sie sich vor dem großen, sauberen Spiegel die ausgefransten Haare, an denen es nichts zu richten gab. Dann ging sie ins Restaurant Sie trank den Rest Ouzo, der sich noch im Glas befand, la¨chelte Dimitri, den Besitzer, an, wu¨nschte ihm insgeheim alle bekannten und unbekannten Geschlechtskrankheiten, am liebsten auf einmal, ein angemessenes Trinkgeld liegen und verließ das Restaurant. Es regnete noch immer in Stro¨men. Im Widerschein einiger Laternen sah der Regen aus wie viele sich nebeneinanderbefindliche Bindfa¨den, die nicht rissen und kein Ende nahmen. Es war kurz vor sieben, als sie in die Karl-Marx-Straße einbog, sie kreuzte und in der Rubelstraße verschwand. Ausgangs stand der Gru¨ne Sierra.’ ’Freie Auswahl. Du hast mit mir den ersten Preis gewonnen.’ ’Was wa¨re der zweite gewesen?’ ’Eine Waschmaschine.’ ’Der erste ist mir, ehrlich gesagt, lieber. Ich mache um elf Schluss.’ ’Dann ist die Auswahl der feinen Lokalita¨ten sehr begrenzt.’ ’Lass deine Beziehungen spielen’, sagte sie. Ich folgte ihr in die kleine, weiß gekachelte Ku¨che, in der auch das Fru¨hstu¨ck die Ga¨ste gemacht wurde. Ihr Anblick ließ mich auf den Abend hoffen.’ ’Um neun hatte ich eine Verabredung mit Marlies. Die konnte Lohmeyer mir nicht kaputtmachen. Jutta Speißer aß Stifado und trank zypriotischen Aphrodite-Wein. Fu¨r Dimitri, den Besitzer des griechischen Restaurants ’Akropolis’ in der Karl-ZahnStraße, war sie ein neuer, willkommener Gast.’ ’Baldwein startete den gru¨nen Sierra. Langsam lenkte er das Fahrzeug am Postgiroamt vorbei und fuhr in Richtung Hoher Wall. Obgleich Polizeimeister Werner Okker gestern Nacht nicht getrunken hatte, wegen der Pflich ten, die er als nun offiziell Verlobter seiner Verlobten gegenu¨ber zu erfu¨llen hatte, sah er schlecht aus. Er saß am Tresen vom Steinkrug. Die eckigen, breiten Schultern waren mu¨de nach vorn abgefallen. Es schien ihm sichtlich Mu¨he zu bereiten, sein Bierglas zu heben. Susanne Steiner stand hinter der Theke. Groß, grobknochig, nordisch. Ein Ma¨dchen, das im Kneipenmilieu groß geworden war. Sie hatte langes, bru¨nettes Haar und ein ausgesprochen scho¨nes Gesicht mit vollen, sinnlichen Lippen. Peter Steiner, ihr Vater, der neben ihr am Zapfhahn stand, war ihr u¨berhaupt nicht a¨hnlich. Er war um die Sechzig herum. Ein ehemaliger Hauer.’ ’Es war ein warmer Morgen Anfang August, die Sonne schien golden in das Fru¨hstu¨ckszimmer des Stadtpalais. Hier hatte sich die Fu¨rstenfamilie Hohenstein zur ersten gemeinsamen Mahlzeit des Tages versammelt. Fu¨rst Heinrich, Familienoberhaupt und Vorstand der Hohenstein-Bank, eines traditionsreichen Hauses am Frankfurter Finanzplatz, unterhielt sich angeregt mit seinem a¨lteren Sohn Bernhard.’ ’Man sah der Mutter an, dass dies nicht unbedingt der Fall war. Doch Hedwig spu¨rte, sie wu¨rde von Carina keine weiteren Ausku¨nfte erhalten. [...] ’Meine kleine Prinzessin So hatte sie Carina als Kind genannt. Keiner von ihnen ha¨tte sich wohl vorstellen ko¨nnen, dass sie jemals eine wirkliche Prinzessin werden wu¨rde. Und wenn die junge Frau ehrlich war, konnte sie es jetzt noch immer nicht so ganz fassen. Wenig spa¨ter fuhr das Brautpaar zum Flughafen. Ewald Bo¨ttiger fragte seine Frau: ’Was hattet ihr denn noch so lange zu bereden? Alle haben auf euch gewartet.’ ’Ich bin mir nicht sicher, ob Carina den Richtigen geheiratet hat.” ’Frederik blickte sinnend in seinen Kaffee. ’Na ja, irgendwann werde ich mir eben ein liebes Frauchen und ein paar Spro¨sslinge zulegen, aber ein bisschen Galgenfrist bleibt mir ja noch. Sagen wir mal zehn bis fu¨nfzehn Jahre ’Du hast Nerven.’ Die Prinzessin musste lachen und erhob sich. ’Das glaubst du doch wohl nicht im Ernst.’ ’Oh doch’, murmelte er und la¨chelte schmal. ’Das weiß ich.’ Prinz Frederik erschien an diesem Morgen etwas spa¨ter als sonst in seinem Bu¨ro. Carina Bo¨ttiger, seine Sekreta¨rin, war das gewohnt und wusste auch, in welchem Zustand ihr Chef an solchen Tagen war. Die zierliche Blondine mit den himmelblauen Augen hielt starken Kaffee und Aspirin bereit. Beides brachte sie zusammen mit der Unterschriftenmappe. [...] ’Sie sehen heute hu¨bsch aus’ , stellte Frederik mit einem Blick auf ihr Kleid fest. Er musterte sie einen Moment lang ziemlich nachdenklich, und sie tat so, als merke sie es gar nicht, bedankte sich nur artig fu¨r das Kompliment und fragte, ob sie sonst noch etwas fu¨r ihn tun ko¨nne. ’Nein, das war im Moment alles.’ Er gab ihr die Unterschriftenmappe zuru¨ck. ’Wenn Herr von Solm kommt, schicken Sie ihn gleich durch.

Ich habe noch etwas mit ihm zu besprechen.’ Er

bemerkte ihren leicht su¨ffisanten Blick und stellte deshalb klar: ’Etwas Gescha¨ftliches.’ Carina la¨chelte leicht und verließ das Chefzimmer. Dass Frederik ihr neues Kleid bemerkt hatte, machte sie glu¨cklich. Bislang hatte sie immer geglaubt, dass er kaum einen Blick fu¨r sie hatte. Doch sie wollte sich darauf auch nichts einbilden. Schließ lich schien es klar, dass dieser Mann außer halb ihrer Reichweite war. Und fu¨r eine kurze Affa¨re mit dem Frauenliebling war sie sich wirklich zu schade.’ ’Prinz Frederik war ganz zufrieden mit sich selbst. Carina hatte seine Ausrede glatt geschluckt. Damit hatte sie ihm sozusagen selbst den Freifahrtschein ausgestellt, um endlich wieder so zu leben, wie es ihm gefiel.Und er war fest entschlossen, dies auch umgehend zu tun Bereits am na¨chsten Abend meldete Frederik sich telefonisch bei seiner Frau und ließ sie wissen, dass es spa¨t wurde. Angeblich wartete er auf den Abschluss eines lukrativen Gescha¨fts. Carina scho¨pfte noch keinen Verdacht.

Als sie ihn am na¨chsten Morgen fragte, wann er

heimgekommen sei, sagte er ihr nicht die Wahrheit.’ ’Prinz Frederik war ganz zufrieden mit sich selbst. Carina hatte seine Ausrede glatt geschluckt. Damit hatte sie ihm sozusagen selbst den Freifahrtschein ausgestellt, um endlich wieder so zu leben, wie es ihm gefiel.Und er war fest entschlossen, dies auch umgehend zu tun . . . Bereits am na¨chsten Abend meldete Frederik sich telefonisch bei seiner Frau und ließ sie wissen, dass es spa¨t wurde. Angeblich wartete er auf den Abschluss eines lukrativen Gescha¨fts. Carina scho¨pfte noch keinen Verdacht.

Als sie ihn am na¨chsten Morgen fragte, wann er

heimgekommen sei, sagte er ihr nicht die Wahrheit.’

Florian

Barth and Tillmann Do¨nicke. 2021 . Participation in the konvens 2021 shared task on scene segmentation using temporal, spatial and entity feature vectors . In Shared Task on Scene Segmentation.

Doug

Beeferman , Adam Berger,

and John

Lafferty . 1999 . Statistical models for text segmentation . Machine learning , 34 ( 1-3 ): 177 - 210 .

David M Blei , Andrew Y Ng, and Michael I

Jordan . 2003 . Latent dirichlet allocation . Journal of machine Learning research , 3 (Jan): 993 - 1022 .

Lynn

Carlson

, Daniel Marcu, and Mary Ellen Okurowski. 2002 . Rst discourse treebank, ldc2002t07 . Technical report , Philadelphia: Linguistic Data Consortium.

Freddy YY Choi . 2000 . Advances in domain independent linear text segmentation . arXiv preprint cs/0003083.

Arman

Cohan , Iz Beltagy, Daniel King, Bhavana Dalvi, and

Dan

Weld . 2019 . Pretrained language models for sequential sentence classification . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 3693 - 3699 , Hong

Kong

, China. Association for Computational Linguistics.

Evelyn

Gius , Fotis Jannidis, Markus Krug, Albin Zehe, Andreas Hotho, Frank Puppe, Jonathan Krebs, Nils Reiter, Nathalie Wiedmer, and

Leonard

Konle . 2019 . Detection of scenes in fiction . InProceedings of Digital Humanities 2019 .

Evelyn

Gius , Carla So¨kefeld, Lea Du¨mpelmann, Lucas Kaufmann, Annekea Schreiber, Svenja Guhr, Nathalie Wiedmer, and

Fotis

Jannidis . 2021 . Guidelines for detection of scenes.

Sebastian

Gombert . 2021 . Twin bert contextualized sentence embedding space learning and gradientboosted decision tree ensembles for scene segmentation in german literature . In Shared Task on Scene Segmentation.

Hans

Ole Hatzel and

Chris

Biemann . 2021 . Applying coreference to literary scene segmentation . In Shared Task on Scene Segmentation.

Marti A

Hearst . 1997 . Texttiling: Segmenting text into multi-paragraph subtopic passages . Computational linguistics , 23 ( 1 ): 33 - 64 .

Adebayo Kolawole

John

, Luigi Di Caro, and

Guido

Boella . 2016 . Text segmentation with topic modeling and entity coherence . In International Conference on Hybrid Intelligent Systems , pages 175 - 185 . Springer.

David

Kauchak and

Francine

Chen . 2005 . Featurebased segmentation of narrative documents . In Proceedings of the ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing , pages 32 - 39 , Ann Arbor, Michigan. Association for Computational Linguistics.

Anna

Kazantseva and

Stan

Szpakowicz . 2014 . Hierarchical topical segmentation with affinity propagation . In Proceedings of COLING 2014 , the 25th International Conference on Computational Linguistics: Technical Papers , pages 37 - 47 , Dublin, Ireland. Dublin City University and Association for Computational Linguistics.

Hideki

Kozima and

Teiji

Furugori . 1993 . Similarity between words computed by spreading activation on an english dictionary . In Proceedings of the European Association for Computational Linguistics.

Hideki

Kozima and

Teiji

Furugori . 1994 . Segmenting narrative text into coherent scenes . Literary and Linguistic Computing , 9 ( 1 ): 13 - 19 .

Murathan

Kurfali and Mats Wire´n. 2021 . Breaking the narrative: Scene segmentation through sequential sentence classification . In Shared Task on Scene Segmentation.

Kenton

Lee ,

Luheng

He , and

Luke

Zettlemoyer . 2018 . Higher-order coreference resolution with coarse-tofine inference . In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 2 ( Short Papers) , pages 687 - 692 , New Orleans, Louisiana. Association for Computational Linguistics.

Jiwei

Li and

Eduard

Hovy . 2014 . A model of coherence based on distributed sentence representation . In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 2039 - 2048 .

Jiwei

Li and

Dan

Jurafsky . 2016 . Neural net models for open-domain discourse coherence . arXiv preprint arXiv:1606 . 01545 .

Barbara J.

Grosz and Candace L. Sidner . 1986 . Attention, intentions, and the structure of discourse . Computational Linguistics , 12 ( 3 ): 175 - 204 .

Michal

Lukasik , Boris Dadachev, Gonc¸alo Simo˜es, and

Kishore

Papineni . 2020 . Text segmentation by cross segment attention .

Yann

Mathet , Antoine Widlo¨cher, and Jean-Philippe Me ´tivier. 2015 . The unified and holistic method gamma () for inter-annotator agreement measure and alignment . Computational Linguistics , 41 ( 3 ): 437 - 479 .

Hemant

Misra , Franc¸ois Yvon, Olivier Cappe´, and Joemon Jose. 2011 . Text segmentation: A topic modeling perspective . Information Processing & Management , 47 ( 4 ): 528 - 544 .

Yihe

Pang , Jie Liu, Jianshe Zhou , and Kai Zhang . 2019 . Paragraph coherence detection model based on recurrent neural networks . In International Conference on Swarm Intelligence , pages 122 - 131 . Springer.

Charuta

Pethe , Allen Kim, and

Steve

Skiena . 2020 . Chapter Captor: Text Segmentation in Novels . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 8373 - 8383 , Online. Association for Computational Linguistics.

Lev

Pevzner and

Marti A.

Hearst . 2002 . A critique and improvement of an evaluation metric for text segmentation . Comput. Linguist. , 28 ( 1 ): 19 - 36 .

Karl

Pichotta and Raymond J Mooney . 2016 . Learning statistical scripts with lstm recurrent neural networks . In Thirtieth AAAI Conference on Artificial Intelligence.

Rashmi

Prasad ,

Alan

Lee , Nikhil Dinesh, Eleni Miltsakaki, Geraud Campion, Aravind Joshi, and

Bonnie

Webber . 2008 . Penn Discourse Treebank Version 2.0 LDC2008T05 . Web download, Linguistic Data Consortium , Philadelphia.

Nils

Reiter . 2015 . Towards Annotating Narrative Segments . In Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage , Social Sciences, and Humanities (LaTeCH) , pages 34 - 38 , Beijing, China. Association for Computational Linguistics.

Martin

Riedl and

Chris

Biemann . 2012 . Topictiling: a text segmentation algorithm based on lda . In Proceedings of ACL 2012 Student Research Workshop , pages 37 - 42 . Association for Computational Linguistics.

Felix

Schneider , Bjo¨rn Barz, and

Joachim

Denzler . 2021 . Detecting scenes in fiction using the embedding delta signal . In Shared Task on Scene Segmentation.

Lilian

Diana Awuor Wanzare ,

Michael

Roth , and

Manfred

Pinkal . 2019 . Detecting everyday scenarios in narrative texts . In Proceedings of the Second Workshop on Storytelling , pages 90 - 106 , Florence, Italy. Association for Computational Linguistics.

Jiacheng

Xu , Zhe Gan, Yu Cheng, and Jingjing Liu. 2019 . Discourse-aware neural extractive model for text summarization . arXiv preprint arXiv: 1910 .14142.

Albin

Zehe , Leonard Konle, Lea Katharina Du¨mpelmann, Evelyn Gius, Andreas Hotho, Fotis Jannidis, Lucas Kaufmann, Markus Krug, Frank Puppe, Nils Reiter, Annekea Schreiber, and

Nathalie

Wiedmer . 2021 . Detecting scenes in fiction: A new segmentation task . InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages 3167 - 3177 , Online. Association for Computational Linguistics.

Xingxing

Zhang , Furu Wei, and

Ming

Zhou . 2019 . HIBERT: Document level pre-training of hierarchical bidirectional transformers for document summarization . In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 5059 - 5069 , Florence, Italy. Association for Computational Linguistics.