=Paper=
{{Paper
|id=Vol-3001/paper3
|storemode=property
|title=LTUHH@STSS: Applying Coreference to Literary Scene Segmentation
|pdfUrl=https://ceur-ws.org/Vol-3001/paper3.pdf
|volume=Vol-3001
|authors=Hans Ole Hatzel,Chris Biemann
|dblpUrl=https://dblp.org/rec/conf/konvens/HatzelB21
}}
==LTUHH@STSS: Applying Coreference to Literary Scene Segmentation==
LTUHH@STSS: Applying Coreference to Literary Scene Segmentation
Hans Ole Hatzel Chris Biemann
Language Technology Group Language Technology Group
Universität Hamburg, Germany Universität Hamburg, Germany
hatzel@informatik.uni-hamburg.de biemann@informatik.uni-hamburg.de
Abstract general approach of the best baseline proposed by
(Zehe et al., 2021a). Further, we enrich the BERT-
In this work, we describe a system for scene
based representation using two sets of features, (a)
segmentation that, relying on character constel-
lations as one of the defining characteristics
a coreference-based approach to finding the charac-
of scenes, employs a state-of-the-art corefer- ters in a given scene and (b) a set of surface features
ence system. Conceptually building on one we believe may be helpful. In a second step, we
of the presented baseline systems, we use a improve our model’s results by adding non-local
transformer model, enhanced with additional decisions in the form of a cost function optimized
coreference-based features, to identify scene using a dynamic programming technique.
boundaries on the basis of sentence pairs. Find-
ing one of our system’s core weaknesses to 2 Related Work
lie in its local decision making, we adapt an
equidistance constraint, avoiding the common Pethe et al. (2020) approach the task of chapter seg-
error of predicting very short scenes that in mentation, the task of splitting a document into its
many cases only cover a single sentence. We chapters. This task is related to scene segmentation
show that coreference is a suitable feature for
in that it operates on a similar domain. As we con-
scene segmentation and experiment with dy-
namic programming approaches for non-local jecture, chapter boundaries may also correspond
decisions. This work is a submission for the with changes in location or characters, making this
shared task scene segmentation (STSS) held at work more relevant still. Pethe et al. (2020) take
KONVENS 2021, where task participants were an equidistant approach to chapter segmentation,
asked to, given annotated training data, build thereby enhancing local decisions with the knowl-
systems that split novels into scenes: segments edge that chapter boundaries tend to be somewhat
narrating a coherent action in one location with evenly placed throughout a novel. The equidistant
the same characters. Our system ranks 4/4 and
approach is applied by minimizing the following
4/5 in Track 1 and Track 2, respectively.
equation:
1 Introduction |n−i|
cost(n,k )=mini∈[0,n−1] cost(i,k−1)+(1−α) L
−α·sn
One of the most defining characteristics of scenes
are character constellations, in this work we de- Where k is the number of breaks to be inserted, n
scribe a scene segmentation system exploiting this the position at which to insert a break and L the
characteristic. Other defining aspects of scenes target length of each segment. α is a hyperparam-
such as the story and discourse time being equal eter controlling the impact of the local boundary
and the fact that they contain a coherent sequence score sn with values approaching one placing more
of actions will not be explicitly modeled in this importance on local decisions.
work. The shared task scene segmentation hosted In our previous work (Schröder et al., 2021),
by Zehe et al. (2021b) provides training data in we trained state-of-the-art models for coreference
the form of 22 dime novels, with an additional (for resolution on German data. Following the coarse-
the task duration) unpublished test set and a single to-fine inference architecture for coreference (Lee
trial document. We chose a transformer-based ap- et al., 2018), we fine-tune transformer models on
proach as a starting point; we use BERT (Devlin the German TüBa-D/Z dataset, adapting them to
et al., 2019) for scene segmentation, following the the literature domain using further fine-tuning on
29
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
the DROC dataset (Krug et al., 2018). While some binary-cross-entropy loss for each of the three la-
of our models enable the handling of arbitrary bels, using class weighting based on the training
length texts, in this work we only rely on the coarse- data distribution.
to-fine model the application of which, due to its
memory requirement characteristics, is limited to 3.1 Coreference Features
shorter documents. Leveraging coreference features we seek to model
one of the central components of scenes: the char-
3 Model and Features acter constellations. To this end, we pass the
number of unique characters appearing in each of
In order to maximize the contextual information
the input sequences, together with the number of
input to BERT, we do not pass an explicit con-
unique characters appearing in both sequences to
text in conjunction with the two sentences in ques-
the model.
tion (unlike the baseline approach in Zehe et al.,
Taking a more global approach to coreference
2021a). Instead, our approach follows the Next
would also be possible, in this case, the number
Sentence Prediction (NSP) training objective in
of characters involved in the current context may
BERT. For each sentence boundary present in the
be compared to the global number of characters.
input data, we predict if the sentence to either side
While this approach may yield further improve-
is part of the same scene or if there is a boundary
ments, we did not test it, partly due to the fact that
between them (i.e. we perform a binary classi-
global coreference resolution for long documents
fication for the input “[CLS] scene candidate a
still is much more susceptible to errors than local
[SEP] scene canidate b [SEP]”). Note that in
approaches (Schröder et al., 2021).
the context of the NSP task, “sentence” actually
refers to any input sequence and not a sentence in 3.2 Named Entity Recognition Features
the linguistic sense. We see this alignment with
One feature that we, following manual inspection
the NSP as a benefit of our system, enabling us
of the training data, expect to be predictive of scene
to leverage more of BERT’s pre-trained capabili-
boundaries are named entities. The explicit men-
ties. For this reason, we also chose to use a BERT
tion of characters as well as that of locations should
model rather than an Electra model (Clark et al.,
indicate a scene change. We extract the named
2020), as Electra models are not trained on the NSP
entity tags for persons, locations, and miscella-
objective.
neous entities and use document-length-normalized
While we did experiment with a BERT model
counts of each of them as a model input. While
trained on German literary data1 , we did not find
the coreference features capture some similar infor-
success with it which, we attributed to the fact
mation, they capture neither location mentions nor
that it is fine-tuned on named entity recognition
are they able to differentiate between explicit and
and may have, in a case of catastrophic forget-
anaphoric character mentions.
ting, lost the ability to perform the NSP task.
Using a NER system trained specifically on liter-
While the coreference-based features rely on pre-
ary data could help this step, such data is available
vious work of ours (Schröder et al., 2021), for
in the DROC dataset (Krug et al., 2018).
all of the remaining feature extraction we used
the “de core news lg” model in spaCy (Honnibal 3.3 Surface Features
et al., 2020). All features are passed into a linear
In an effort to improve our model, we added a set of
layer with GELU activation function (Hendrycks
surface features that we believed may be indicative
and Gimpel, 2020) in conjunction with the pooled
of scene changes. We passed the number of tokens
BERT output (i.e. the [CLS] token’s embedding).
(including special characters such as quotes and
Final predictions are made using individual linear
punctuation) fulfilling different properties to our
layers for each of the three outputs: binary scene
model
type labels for each of the two sequences and the
binary decision of whether there is a scene bound- • being punctuation
ary between them, each with sigmoid activation • being uppercased
functions. The model is trained using SGD and • being quotation marks
1
https://huggingface.co/ • being a stop word
severinsimmler/literary-german-bert • being the start of a sentence
30
While all these features could, in principle, be 5 Non-Local Model
picked up by means of representation learning in
our neural model, we still add them due to the As discussed in Section 4 we see an issue in the lo-
relatively small number of training samples. cal nature of scene segmentation boundaries. One
approach to remedy this may be, training on se-
quences of adjacent sentence pairs; this would have
4 Intermediate Results the advantage of allowing for non-local decisions,
informed by any part of neighboring inputs. At
the same time, however, this increases the mem-
While, in principle, our model is capable of pre-
ory requirements, and with scene boundaries oc-
dicting both scene boundaries and scene types, our
curring about every 43 sentences on average, a
final system uses two distinct models with the same
large enough context may (depending on available
architecture and inputs for the two tasks. Joint
GPU memory) be infeasible to jointly train. Our
training presents non-trivial challenges in balanc-
early approaches instead focused on using neural
ing the two target objectives but may yield im-
sequence models on local decision outputs but us-
provements in final results. Both models were
ing this approach we did not manage to improve
trained with early stopping on the trial data (i.e.
upon local-decision-based results.
one document provided with the task description
Instead, we chose a purely algorithmic approach
but not as part of the training data); a hyperparame-
without training: the dynamic programming (DP)
ter search for individual learning rates for the final
approach by Pethe et al. (2020), a technique that
layers (between 1 × 10−3 and 1 × 10−5 ) and the
requires prior knowledge of the number of chapters,
BERT model (between 1 × 10−4 and 2 × 10−5 )
or in our case scene, boundaries. Applying their
was performed using the Tree-structured Parzen
approach to the task’s trial document which was
Estimator (Bergstra et al., 2011) implementation
held-out, given the correct number of scene bound-
by Akiba et al. (2019). The final model for scene
aries, (with α = 0.9) results in an F1-score of 39.1.
types stopped after 5000 (returning to the set of
This represents is an improvement of around 5.4 on
weights from step 2000) steps of batch size 24
the local F1-score of 33.7. For comparison, when
(with an evaluation frequency of 1000 steps) and
only using the k highest confidence values, where
used a learning rate of 9.9 × 10−5 for BERT and
k is the number of gold boundaries, we only get an
6.4 × 10−4 for the final layers. The final model for
F1-Score of 34.8, illustrating that the mere knowl-
scene types stopped after 18 000 (returning to the
edge of the number of scenes is not as impactful.
set of weights from step 15 000) steps of batch size
Figure 2 shows the effect the cost function can
24 (with an evaluation frequency of 1000 steps)
have on decisions, while α = 0.7 actually entails
and used a learning rate of 4.8 × 10−5 for BERT
a worse F1-Score, the effect is very subtle when
and 2.84 × 10−5 for the final layers.
using larger α values (i.e. when incorporating local
Using the features described so far we reach decisions to a larger extent).
an F1-score of 33.7 on the task’s trial document2 , Figure 3 illustrates that the coefficient of varia-
presumably already outperforming the baseline sys- tion (CV) for the shared task’s scene boundary is
tem. Figure 1 illustrates the predicted boundaries much higher than it is for the chapter data in the
together with the networks output values for each work by Pethe et al. (2020), where the distribution
of the potential scene splits, i.e. each pair of sen- is centered around a value below 0.5. This can be
tences. Notably, there are multiple cases of two or interpreted as the length of chapters inside most
more directly adjacent instances of false positives. documents being less variable than the length of
Sometimes, like at the very end of the document, scenes in many documents in our dataset. Although
in conjunction with a true positive boundary. This it is to be noted that the two statistics are made on
illustrates what we see as a key weakness of our ini- the basis of very different datasets. The standard
tial model; since decisions are purely local, when deviation of the distribution of average per docu-
in doubt about the placement, the model creates ment scene lengths (in sentences) is 10.84 with a
multiple boundaries where one would be sufficient. mean of 45.3 and, accordingly, a CV of 0.24.
Another very simple approach to using non-local
2
Unless otherwise specified F1-score refers to the bound- information is to, in a fixed window, only consider
ary class’s F1-score throughout this document the top value to actually constitute a boundary. For
31
1.00
0.75
Model Confidence
True Positives
False Positives
0.50 False Negatives
Model Prediction
0.25
0.00
0 200 400 600 800 1000
Sentence Number
Figure 1: Positions of scene splits in the trial data using only local decisions
1.00
0.75
Model Confidence
True Positives
False Positives
0.50 False Negatives
Model Prediction
0.25
0.00
0 200 400 600 800 1000
Sentence Number
Figure 2: Positions of scene splits using the DP technique with α = 0.7
tion with the variance of 0.74 in the task’s trial
4.0 document illustrate just how important non-local
3.5 information is to improving performance in this
3.0 task. Further work on neural sequence models may
Number of Documents
2.5 yield significant improvements.
2.0
Our final model uses the DP approach by Pethe
1.5
et al. (2020) with α = 0.8, a strong focus on lo-
1.0
cal values. As explicitly stated in their paper, this
0.5
method assumes knowledge of the actual number
0.0
0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 of boundaries, which is not the case for our data.
Coefficient of Variation (CV)
We apply the heuristic of assuming the number
Figure 3: The coefficient of variation in scene lengths of actual boundaries to be equal to the number of
for each individual document in the training data. locally predicted boundaries. This way our the non-
local approach effectively only moves the positions
at which splits happen but does not change their
this, we walk across the boundary candidates and, total number. Unsurprisingly, given the variance
in a fixed-sized window, set the boundary class in scene lengths, we found this to outperform the
to zero for all but the largest value in the window. heuristic of dividing the text length by the average
With a window size of five, for example, this means scene length. Further, we adapt the cost function to
that no candidate with larger confidence values be more lenient with regard to scenes shorter than
in its four neighbors (two to either side) will be the average, as long as they are not too short.
predicted. Using this simple strategy, however, we
adversely impact the quality of our predictions, Figure 4 shows how we adapt the equidistant
going from an F1-Score of 33.7 to one of 27.8. constraint by Pethe et al. (2020) to punish very
The improvements attained by application of the short distances. Where their cost function is linear
DP technique by Pethe et al. (2020) in combina- in both directions, we adapt it to only punish very
32
Equidistant Cost (Pethe et al., 2020) On the official evaluation metric we only reach
max(−log(x + 1) · β1 , x2 ) an F1-score of 0.02 for Track 1 and an F1-score
of 0.11 for Track 2. These are below the boundary
1 class performance discussed earlier as they include
the correct classification of scene types. With out
cost
0.5 system focusing mostly on the placement of scene
boundaries it could potentially be extended with
features more suitable for scene classification.
0
−L 0 L The system performs relatively poorly in Track
Deviation from target distance 1, reaching the last place with quite a margin to the
next system, but much better in Track 2 where it is
Figure 4: The cost associated with deviation from the close behind the third-placed system, what exactly
target distance L, where a deviation of −L is equivalent causes this difference in performance remains un-
to a boundary distance of zero clear. We stay far behind the performance of the
top-scoring systems but coreference seems to be
short scenes harshly. a salient feature that may be useful to include in
1 future systems.
−log(x + 1) · (1)
β
For this, we apply the cost function in Equation 1 References
to negative distances relative to the target distance
Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru
L, β is a hyperparameter controlling how close to Ohta, and Masanori Koyama. 2019. Optuna: A next-
a distance of zero very large costs set in; we use generation hyperparameter optimization framework.
β = 2. For positive distances, we use x2 effectively In Proceedings of the 25th ACM SIGKDD Interna-
increasing the inherent α but also changing the tional Conference on Knowledge Discovery and Data
Mining, page 2623–2631, Anchorage, Alaska, USA.
relation of long distances to short ones. Association for Computing Machinery.
Evaluating the same technique on our training
data yielded a marginal improvement of around James Bergstra, Rémi Bardenet, Yoshua Bengio, and
0.01 F1, this is to be expected as some memoriza- Balázs Kégl. 2011. Algorithms for hyper-parameter
optimization. In Advances in Neural Information
tion of training samples should lead to improved
Processing Systems, volume 24, pages 469–477,
local decisions. This result does give us confidence Granada, Spain. Curran Associates, Inc.
the approach will not adversely impact test set per-
formance. Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
While, after optimizing alpha on the held-out Christopher D. Manning. 2020. ELECTRA: Pre-
training text encoders as discriminators rather than
data, the equidistant cost function performed on generators. In International Conference on Learning
par with our cost function on the same data, when Representations, Online.
adapting to the training data (on which our α value
was not optimized) the equidistant function only Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
increased performance by 0.003 F1. deep bidirectional transformers for language under-
Further analysis is needed to provide a clear pic- standing. In Proceedings of the 2019 Conference of
ture of cost function’s impact on unseen data. It the North American Chapter of the Association for
however already seems plausible that our adapta- Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
tion of the cost function presents an improvement 4171–4186, Minneapolis, Minnesota, USA. Associa-
over the equidistant cost function. tion for Computational Linguistics.
6 Conclusion and Final Results Dan Hendrycks and Kevin Gimpel. 2020. Gaussian
error linear units (GELUs). Computing Research
We present an approach to scene segmentation that Repository, arxiv:1606.08415. Version 4.
relies on character information. While we do not
produce irrefutable evidence of its advantages, we Matthew Honnibal, Ines Montani, Sofie Van Lan-
deghem, and Adriane Boyd. 2020. spaCy:
propose a cost function more suitable to the needs Industrial-strength Natural Language Processing
of scene segmentation, adapting the work by Pethe in Python. https://github.com/explosion/
et al. (2020) to a new task. spaCy/tree/v3.1.1.
33
Markus Krug, Lukas Weimer, Isabella Reger, Luisa
Macharowsky, Stephan Feldhaus, Frank Puppe, and
Fotis Jannidis. 2018. Description of a corpus of char-
acter references in German novels-DROC [Deutsches
ROman Corpus]. DARIAH-DE Working Papers,
27:1–16.
Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.
Higher-order coreference resolution with coarse-to-
fine inference. In Proceedings of the 2018 Confer-
ence of the North American Chapter of the Associ-
ation for Computational Linguistics: Human Lan-
guage Technologies, Volume 2 (Short Papers), pages
687–692, New Orleans, Louisiana, USA. Association
for Computational Linguistics.
Charuta Pethe, Allen Kim, and Steve Skiena. 2020.
Chapter Captor: Text Segmentation in Novels. In
Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP),
pages 8373–8383, Online. Association for Computa-
tional Linguistics.
Fynn Schröder, Hans Ole Hatzel, and Chris Biemann.
2021. Neural end-to-end coreference resolution for
German in different domains. In Proceedings of the
17th Conference on Natural Language Processing,
Düsseldorf, Germany.
Albin Zehe, Leonard Konle, Lea Katharina
Dümpelmann, Evelyn Gius, Andreas Hotho,
Fotis Jannidis, Lucas Kaufmann, Markus Krug,
Frank Puppe, Nils Reiter, Annekea Schreiber, and
Nathalie Wiedmer. 2021a. Detecting scenes in
fiction: A new segmentation task. In Proceedings
of the 16th Conference of the European Chapter of
the Association for Computational Linguistics: Main
Volume, pages 3167–3177, Online. Association for
Computational Linguistics.
Albin Zehe, Leonard Konle, Svenja Guhr, Lea Katha-
rina Dümpelmann, Evelyn Gius, Andreas Hotho, Fo-
tis Jannidis, Lucas Kaufmann, Markus Krug, Frank
Puppe, Nils Reiter, and Annekea Schreiber. 2021b.
Shared task on scene segmentation@konvens2021.
In Shared Task on Scene Segmentation.
34