=Paper= {{Paper |id=Vol-3001/paper3 |storemode=property |title=LTUHH@STSS: Applying Coreference to Literary Scene Segmentation |pdfUrl=https://ceur-ws.org/Vol-3001/paper3.pdf |volume=Vol-3001 |authors=Hans Ole Hatzel,Chris Biemann |dblpUrl=https://dblp.org/rec/conf/konvens/HatzelB21 }} ==LTUHH@STSS: Applying Coreference to Literary Scene Segmentation== https://ceur-ws.org/Vol-3001/paper3.pdf
    LTUHH@STSS: Applying Coreference to Literary Scene Segmentation


                     Hans Ole Hatzel                                              Chris Biemann
                Language Technology Group                                   Language Technology Group
               Universität Hamburg, Germany                               Universität Hamburg, Germany
            hatzel@informatik.uni-hamburg.de                           biemann@informatik.uni-hamburg.de




                         Abstract                                      general approach of the best baseline proposed by
                                                                       (Zehe et al., 2021a). Further, we enrich the BERT-
     In this work, we describe a system for scene
                                                                       based representation using two sets of features, (a)
     segmentation that, relying on character constel-
     lations as one of the defining characteristics
                                                                       a coreference-based approach to finding the charac-
     of scenes, employs a state-of-the-art corefer-                    ters in a given scene and (b) a set of surface features
     ence system. Conceptually building on one                         we believe may be helpful. In a second step, we
     of the presented baseline systems, we use a                       improve our model’s results by adding non-local
     transformer model, enhanced with additional                       decisions in the form of a cost function optimized
     coreference-based features, to identify scene                     using a dynamic programming technique.
     boundaries on the basis of sentence pairs. Find-
     ing one of our system’s core weaknesses to                        2    Related Work
     lie in its local decision making, we adapt an
     equidistance constraint, avoiding the common                      Pethe et al. (2020) approach the task of chapter seg-
     error of predicting very short scenes that in                     mentation, the task of splitting a document into its
     many cases only cover a single sentence. We                       chapters. This task is related to scene segmentation
     show that coreference is a suitable feature for
                                                                       in that it operates on a similar domain. As we con-
     scene segmentation and experiment with dy-
     namic programming approaches for non-local                        jecture, chapter boundaries may also correspond
     decisions. This work is a submission for the                      with changes in location or characters, making this
     shared task scene segmentation (STSS) held at                     work more relevant still. Pethe et al. (2020) take
     KONVENS 2021, where task participants were                        an equidistant approach to chapter segmentation,
     asked to, given annotated training data, build                    thereby enhancing local decisions with the knowl-
     systems that split novels into scenes: segments                   edge that chapter boundaries tend to be somewhat
     narrating a coherent action in one location with                  evenly placed throughout a novel. The equidistant
     the same characters. Our system ranks 4/4 and
                                                                       approach is applied by minimizing the following
     4/5 in Track 1 and Track 2, respectively.
                                                                       equation:
1    Introduction                                                                                                         |n−i|
                                                                                                                                  
                                                                        cost(n,k )=mini∈[0,n−1] cost(i,k−1)+(1−α)           L
                                                                                                                                      −α·sn
One of the most defining characteristics of scenes
are character constellations, in this work we de-                      Where k is the number of breaks to be inserted, n
scribe a scene segmentation system exploiting this                     the position at which to insert a break and L the
characteristic. Other defining aspects of scenes                       target length of each segment. α is a hyperparam-
such as the story and discourse time being equal                       eter controlling the impact of the local boundary
and the fact that they contain a coherent sequence                     score sn with values approaching one placing more
of actions will not be explicitly modeled in this                      importance on local decisions.
work. The shared task scene segmentation hosted                           In our previous work (Schröder et al., 2021),
by Zehe et al. (2021b) provides training data in                       we trained state-of-the-art models for coreference
the form of 22 dime novels, with an additional (for                    resolution on German data. Following the coarse-
the task duration) unpublished test set and a single                   to-fine inference architecture for coreference (Lee
trial document. We chose a transformer-based ap-                       et al., 2018), we fine-tune transformer models on
proach as a starting point; we use BERT (Devlin                        the German TüBa-D/Z dataset, adapting them to
et al., 2019) for scene segmentation, following the                    the literature domain using further fine-tuning on

                                                                  29
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
the DROC dataset (Krug et al., 2018). While some              binary-cross-entropy loss for each of the three la-
of our models enable the handling of arbitrary                bels, using class weighting based on the training
length texts, in this work we only rely on the coarse-        data distribution.
to-fine model the application of which, due to its
memory requirement characteristics, is limited to             3.1   Coreference Features
shorter documents.                                            Leveraging coreference features we seek to model
                                                              one of the central components of scenes: the char-
3   Model and Features                                        acter constellations. To this end, we pass the
                                                              number of unique characters appearing in each of
In order to maximize the contextual information
                                                              the input sequences, together with the number of
input to BERT, we do not pass an explicit con-
                                                              unique characters appearing in both sequences to
text in conjunction with the two sentences in ques-
                                                              the model.
tion (unlike the baseline approach in Zehe et al.,
                                                                 Taking a more global approach to coreference
2021a). Instead, our approach follows the Next
                                                              would also be possible, in this case, the number
Sentence Prediction (NSP) training objective in
                                                              of characters involved in the current context may
BERT. For each sentence boundary present in the
                                                              be compared to the global number of characters.
input data, we predict if the sentence to either side
                                                              While this approach may yield further improve-
is part of the same scene or if there is a boundary
                                                              ments, we did not test it, partly due to the fact that
between them (i.e. we perform a binary classi-
                                                              global coreference resolution for long documents
fication for the input “[CLS] scene candidate a
                                                              still is much more susceptible to errors than local
[SEP] scene canidate b [SEP]”). Note that in
                                                              approaches (Schröder et al., 2021).
the context of the NSP task, “sentence” actually
refers to any input sequence and not a sentence in            3.2   Named Entity Recognition Features
the linguistic sense. We see this alignment with
                                                              One feature that we, following manual inspection
the NSP as a benefit of our system, enabling us
                                                              of the training data, expect to be predictive of scene
to leverage more of BERT’s pre-trained capabili-
                                                              boundaries are named entities. The explicit men-
ties. For this reason, we also chose to use a BERT
                                                              tion of characters as well as that of locations should
model rather than an Electra model (Clark et al.,
                                                              indicate a scene change. We extract the named
2020), as Electra models are not trained on the NSP
                                                              entity tags for persons, locations, and miscella-
objective.
                                                              neous entities and use document-length-normalized
   While we did experiment with a BERT model
                                                              counts of each of them as a model input. While
trained on German literary data1 , we did not find
                                                              the coreference features capture some similar infor-
success with it which, we attributed to the fact
                                                              mation, they capture neither location mentions nor
that it is fine-tuned on named entity recognition
                                                              are they able to differentiate between explicit and
and may have, in a case of catastrophic forget-
                                                              anaphoric character mentions.
ting, lost the ability to perform the NSP task.
                                                                 Using a NER system trained specifically on liter-
While the coreference-based features rely on pre-
                                                              ary data could help this step, such data is available
vious work of ours (Schröder et al., 2021), for
                                                              in the DROC dataset (Krug et al., 2018).
all of the remaining feature extraction we used
the “de core news lg” model in spaCy (Honnibal                3.3   Surface Features
et al., 2020). All features are passed into a linear
                                                              In an effort to improve our model, we added a set of
layer with GELU activation function (Hendrycks
                                                              surface features that we believed may be indicative
and Gimpel, 2020) in conjunction with the pooled
                                                              of scene changes. We passed the number of tokens
BERT output (i.e. the [CLS] token’s embedding).
                                                              (including special characters such as quotes and
Final predictions are made using individual linear
                                                              punctuation) fulfilling different properties to our
layers for each of the three outputs: binary scene
                                                              model
type labels for each of the two sequences and the
binary decision of whether there is a scene bound-               • being punctuation
ary between them, each with sigmoid activation                   • being uppercased
functions. The model is trained using SGD and                    • being quotation marks
  1
    https://huggingface.co/                                      • being a stop word
severinsimmler/literary-german-bert                              • being the start of a sentence

                                                         30
   While all these features could, in principle, be                   5   Non-Local Model
picked up by means of representation learning in
our neural model, we still add them due to the                        As discussed in Section 4 we see an issue in the lo-
relatively small number of training samples.                          cal nature of scene segmentation boundaries. One
                                                                      approach to remedy this may be, training on se-
                                                                      quences of adjacent sentence pairs; this would have
4    Intermediate Results                                             the advantage of allowing for non-local decisions,
                                                                      informed by any part of neighboring inputs. At
                                                                      the same time, however, this increases the mem-
While, in principle, our model is capable of pre-
                                                                      ory requirements, and with scene boundaries oc-
dicting both scene boundaries and scene types, our
                                                                      curring about every 43 sentences on average, a
final system uses two distinct models with the same
                                                                      large enough context may (depending on available
architecture and inputs for the two tasks. Joint
                                                                      GPU memory) be infeasible to jointly train. Our
training presents non-trivial challenges in balanc-
                                                                      early approaches instead focused on using neural
ing the two target objectives but may yield im-
                                                                      sequence models on local decision outputs but us-
provements in final results. Both models were
                                                                      ing this approach we did not manage to improve
trained with early stopping on the trial data (i.e.
                                                                      upon local-decision-based results.
one document provided with the task description
                                                                          Instead, we chose a purely algorithmic approach
but not as part of the training data); a hyperparame-
                                                                      without training: the dynamic programming (DP)
ter search for individual learning rates for the final
                                                                      approach by Pethe et al. (2020), a technique that
layers (between 1 × 10−3 and 1 × 10−5 ) and the
                                                                      requires prior knowledge of the number of chapters,
BERT model (between 1 × 10−4 and 2 × 10−5 )
                                                                      or in our case scene, boundaries. Applying their
was performed using the Tree-structured Parzen
                                                                      approach to the task’s trial document which was
Estimator (Bergstra et al., 2011) implementation
                                                                      held-out, given the correct number of scene bound-
by Akiba et al. (2019). The final model for scene
                                                                      aries, (with α = 0.9) results in an F1-score of 39.1.
types stopped after 5000 (returning to the set of
                                                                      This represents is an improvement of around 5.4 on
weights from step 2000) steps of batch size 24
                                                                      the local F1-score of 33.7. For comparison, when
(with an evaluation frequency of 1000 steps) and
                                                                      only using the k highest confidence values, where
used a learning rate of 9.9 × 10−5 for BERT and
                                                                      k is the number of gold boundaries, we only get an
6.4 × 10−4 for the final layers. The final model for
                                                                      F1-Score of 34.8, illustrating that the mere knowl-
scene types stopped after 18 000 (returning to the
                                                                      edge of the number of scenes is not as impactful.
set of weights from step 15 000) steps of batch size
                                                                      Figure 2 shows the effect the cost function can
24 (with an evaluation frequency of 1000 steps)
                                                                      have on decisions, while α = 0.7 actually entails
and used a learning rate of 4.8 × 10−5 for BERT
                                                                      a worse F1-Score, the effect is very subtle when
and 2.84 × 10−5 for the final layers.
                                                                      using larger α values (i.e. when incorporating local
   Using the features described so far we reach                       decisions to a larger extent).
an F1-score of 33.7 on the task’s trial document2 ,                       Figure 3 illustrates that the coefficient of varia-
presumably already outperforming the baseline sys-                    tion (CV) for the shared task’s scene boundary is
tem. Figure 1 illustrates the predicted boundaries                    much higher than it is for the chapter data in the
together with the networks output values for each                     work by Pethe et al. (2020), where the distribution
of the potential scene splits, i.e. each pair of sen-                 is centered around a value below 0.5. This can be
tences. Notably, there are multiple cases of two or                   interpreted as the length of chapters inside most
more directly adjacent instances of false positives.                  documents being less variable than the length of
Sometimes, like at the very end of the document,                      scenes in many documents in our dataset. Although
in conjunction with a true positive boundary. This                    it is to be noted that the two statistics are made on
illustrates what we see as a key weakness of our ini-                 the basis of very different datasets. The standard
tial model; since decisions are purely local, when                    deviation of the distribution of average per docu-
in doubt about the placement, the model creates                       ment scene lengths (in sentences) is 10.84 with a
multiple boundaries where one would be sufficient.                    mean of 45.3 and, accordingly, a CV of 0.24.
                                                                          Another very simple approach to using non-local
    2
      Unless otherwise specified F1-score refers to the bound-        information is to, in a fixed window, only consider
ary class’s F1-score throughout this document                         the top value to actually constitute a boundary. For

                                                                 31
                   1.00

                   0.75
Model Confidence


                                                                                                                                                     True Positives
                                                                                                                                                     False Positives
                   0.50                                                                                                                              False Negatives
                                                                                                                                                     Model Prediction
                   0.25

                   0.00
                                        0                   200                     400                600            800          1000
                                                                                            Sentence Number

                                                   Figure 1: Positions of scene splits in the trial data using only local decisions
                   1.00

                   0.75
Model Confidence




                                                                                                                                                     True Positives
                                                                                                                                                     False Positives
                   0.50                                                                                                                              False Negatives
                                                                                                                                                     Model Prediction
                   0.25

                   0.00
                                        0                   200                     400                600            800          1000
                                                                                            Sentence Number

                                                       Figure 2: Positions of scene splits using the DP technique with α = 0.7


                                                                                                              tion with the variance of 0.74 in the task’s trial
                                4.0                                                                           document illustrate just how important non-local
                                3.5                                                                           information is to improving performance in this
                                3.0                                                                           task. Further work on neural sequence models may
          Number of Documents




                                2.5                                                                           yield significant improvements.
                                2.0
                                                                                                                 Our final model uses the DP approach by Pethe
                                1.5
                                                                                                              et al. (2020) with α = 0.8, a strong focus on lo-
                                1.0
                                                                                                              cal values. As explicitly stated in their paper, this
                                0.5
                                                                                                              method assumes knowledge of the actual number
                                0.0
                                      0.5   0.6   0.7      0.8      0.9      1.0   1.1    1.2                 of boundaries, which is not the case for our data.
                                                  Coefficient of Variation (CV)
                                                                                                              We apply the heuristic of assuming the number
  Figure 3: The coefficient of variation in scene lengths                                                     of actual boundaries to be equal to the number of
  for each individual document in the training data.                                                          locally predicted boundaries. This way our the non-
                                                                                                              local approach effectively only moves the positions
                                                                                                              at which splits happen but does not change their
 this, we walk across the boundary candidates and,                                                            total number. Unsurprisingly, given the variance
 in a fixed-sized window, set the boundary class                                                              in scene lengths, we found this to outperform the
 to zero for all but the largest value in the window.                                                         heuristic of dividing the text length by the average
 With a window size of five, for example, this means                                                          scene length. Further, we adapt the cost function to
 that no candidate with larger confidence values                                                              be more lenient with regard to scenes shorter than
 in its four neighbors (two to either side) will be                                                           the average, as long as they are not too short.
 predicted. Using this simple strategy, however, we
 adversely impact the quality of our predictions,                                                                Figure 4 shows how we adapt the equidistant
 going from an F1-Score of 33.7 to one of 27.8.                                                               constraint by Pethe et al. (2020) to punish very
    The improvements attained by application of the                                                           short distances. Where their cost function is linear
 DP technique by Pethe et al. (2020) in combina-                                                              in both directions, we adapt it to only punish very

                                                                                                        32
                 Equidistant Cost (Pethe et al., 2020)             On the official evaluation metric we only reach
                     max(−log(x + 1) · β1 , x2 )                an F1-score of 0.02 for Track 1 and an F1-score
                                                                of 0.11 for Track 2. These are below the boundary
            1                                                   class performance discussed earlier as they include
                                                                the correct classification of scene types. With out
    cost




           0.5                                                  system focusing mostly on the placement of scene
                                                                boundaries it could potentially be extended with
                                                                features more suitable for scene classification.
            0
                 −L             0              L                   The system performs relatively poorly in Track
                  Deviation from target distance                1, reaching the last place with quite a margin to the
                                                                next system, but much better in Track 2 where it is
Figure 4: The cost associated with deviation from the           close behind the third-placed system, what exactly
target distance L, where a deviation of −L is equivalent        causes this difference in performance remains un-
to a boundary distance of zero                                  clear. We stay far behind the performance of the
                                                                top-scoring systems but coreference seems to be
short scenes harshly.                                           a salient feature that may be useful to include in
                                      1                         future systems.
                      −log(x + 1) ·                  (1)
                                      β
For this, we apply the cost function in Equation 1              References
to negative distances relative to the target distance
                                                                Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru
L, β is a hyperparameter controlling how close to                 Ohta, and Masanori Koyama. 2019. Optuna: A next-
a distance of zero very large costs set in; we use                generation hyperparameter optimization framework.
β = 2. For positive distances, we use x2 effectively              In Proceedings of the 25th ACM SIGKDD Interna-
increasing the inherent α but also changing the                   tional Conference on Knowledge Discovery and Data
                                                                  Mining, page 2623–2631, Anchorage, Alaska, USA.
relation of long distances to short ones.                         Association for Computing Machinery.
   Evaluating the same technique on our training
data yielded a marginal improvement of around                   James Bergstra, Rémi Bardenet, Yoshua Bengio, and
0.01 F1, this is to be expected as some memoriza-                 Balázs Kégl. 2011. Algorithms for hyper-parameter
                                                                  optimization. In Advances in Neural Information
tion of training samples should lead to improved
                                                                  Processing Systems, volume 24, pages 469–477,
local decisions. This result does give us confidence              Granada, Spain. Curran Associates, Inc.
the approach will not adversely impact test set per-
formance.                                                       Kevin Clark, Minh-Thang Luong, Quoc V. Le, and
   While, after optimizing alpha on the held-out                  Christopher D. Manning. 2020. ELECTRA: Pre-
                                                                  training text encoders as discriminators rather than
data, the equidistant cost function performed on                  generators. In International Conference on Learning
par with our cost function on the same data, when                 Representations, Online.
adapting to the training data (on which our α value
was not optimized) the equidistant function only                Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                                   Kristina Toutanova. 2019. BERT: Pre-training of
increased performance by 0.003 F1.                                 deep bidirectional transformers for language under-
   Further analysis is needed to provide a clear pic-              standing. In Proceedings of the 2019 Conference of
ture of cost function’s impact on unseen data. It                  the North American Chapter of the Association for
however already seems plausible that our adapta-                  Computational Linguistics: Human Language Tech-
                                                                   nologies, Volume 1 (Long and Short Papers), pages
tion of the cost function presents an improvement                 4171–4186, Minneapolis, Minnesota, USA. Associa-
over the equidistant cost function.                                tion for Computational Linguistics.

6      Conclusion and Final Results                             Dan Hendrycks and Kevin Gimpel. 2020. Gaussian
                                                                  error linear units (GELUs). Computing Research
We present an approach to scene segmentation that                 Repository, arxiv:1606.08415. Version 4.
relies on character information. While we do not
produce irrefutable evidence of its advantages, we              Matthew Honnibal, Ines Montani, Sofie Van Lan-
                                                                 deghem, and Adriane Boyd. 2020.          spaCy:
propose a cost function more suitable to the needs               Industrial-strength Natural Language Processing
of scene segmentation, adapting the work by Pethe                in Python. https://github.com/explosion/
et al. (2020) to a new task.                                     spaCy/tree/v3.1.1.

                                                           33
Markus Krug, Lukas Weimer, Isabella Reger, Luisa
 Macharowsky, Stephan Feldhaus, Frank Puppe, and
 Fotis Jannidis. 2018. Description of a corpus of char-
 acter references in German novels-DROC [Deutsches
 ROman Corpus]. DARIAH-DE Working Papers,
 27:1–16.
Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.
  Higher-order coreference resolution with coarse-to-
  fine inference. In Proceedings of the 2018 Confer-
  ence of the North American Chapter of the Associ-
  ation for Computational Linguistics: Human Lan-
  guage Technologies, Volume 2 (Short Papers), pages
  687–692, New Orleans, Louisiana, USA. Association
  for Computational Linguistics.

Charuta Pethe, Allen Kim, and Steve Skiena. 2020.
  Chapter Captor: Text Segmentation in Novels. In
  Proceedings of the 2020 Conference on Empirical
  Methods in Natural Language Processing (EMNLP),
  pages 8373–8383, Online. Association for Computa-
  tional Linguistics.

Fynn Schröder, Hans Ole Hatzel, and Chris Biemann.
  2021. Neural end-to-end coreference resolution for
  German in different domains. In Proceedings of the
  17th Conference on Natural Language Processing,
  Düsseldorf, Germany.
Albin Zehe, Leonard Konle, Lea Katharina
  Dümpelmann, Evelyn Gius, Andreas Hotho,
  Fotis Jannidis, Lucas Kaufmann, Markus Krug,
  Frank Puppe, Nils Reiter, Annekea Schreiber, and
  Nathalie Wiedmer. 2021a. Detecting scenes in
  fiction: A new segmentation task. In Proceedings
  of the 16th Conference of the European Chapter of
  the Association for Computational Linguistics: Main
  Volume, pages 3167–3177, Online. Association for
  Computational Linguistics.
Albin Zehe, Leonard Konle, Svenja Guhr, Lea Katha-
  rina Dümpelmann, Evelyn Gius, Andreas Hotho, Fo-
  tis Jannidis, Lucas Kaufmann, Markus Krug, Frank
  Puppe, Nils Reiter, and Annekea Schreiber. 2021b.
  Shared task on scene segmentation@konvens2021.
  In Shared Task on Scene Segmentation.




                                                          34