=Paper=
{{Paper
|id=Vol-3001/paper5
|storemode=property
|title=Twin BERT Contextualized Sentence Embedding Space Learning and Gradient-Boosted Decision Tree Ensembles for Scene Segmentation in German Literature
|pdfUrl=https://ceur-ws.org/Vol-3001/paper5.pdf
|volume=Vol-3001
|authors=Sebastian Gombert
|dblpUrl=https://dblp.org/rec/conf/konvens/Gombert21
}}
==Twin BERT Contextualized Sentence Embedding Space Learning and Gradient-Boosted Decision Tree Ensembles for Scene Segmentation in German Literature==
Twin BERT Contextualized Sentence Embedding Space Learning and
Gradient-Boosted Decision Tree Ensembles for Scene Segmentation in
German Literature
Sebastian Gombert
Information Center for Education
DIPF: Leibniz Institute for Research and Information in Education
Frankfurt am Main, Germany
gombert@dipf.de
Abstract rately improving the performance for such follow
up processing.
This paper documents a submission to the This paper presents an a participating system
shared task on scene segmentation hosted at at the KONVENS 2021 shared task on scene seg-
KONVENS 2021 (Zehe et al., 2021b). The
mentation (Zehe et al., 2021b) and relies on two
aim of this shared task was to find methods
for segmenting narrative texts into different steps. For the first one, a BERT-based (Devlin
scenes – segments of text where location, time et al., 2019) neural network trained in a twin net-
and the constellation of characters stay more work setup is used to predict embeddings for re-
or less coherent. This task is formulated as spective input sentences (Reimers and Gurevych,
a sentence classification task where sentences 2019). This network was trained to provide an em-
bordering the scenes have to be distinguished bedding space in which sentences bordering scenes
from in-scene sentences. The approach pre- are well-separated from in-scene ones. For the sec-
sented in this paper is based on two steps. In
the first one, a twin BERT training setup is
ond step, gradient-boosted decision tree ensembles
used to learn a sentence embedding space in (Mason et al., 1999) are then fed these sentence
which sentences functioning as scene borders embeddings as feature vectors to carry out final
are well-separated from ones that are in-scene. predictions.
In the second one, the sentence embeddings For shared task evaluations, this system was
generated by this model are used as feature trained on a data set consisting of various Ger-
vectors to feed a gradient-boosted decision tree man dime novels where scene borders had been
ensemble which conducts final predictions. In
previously annotated. Participating systems were
the shared task leaderboard, the system ranked
second in track 1 and first in track 2. evaluated in two tracks using F1 scores. In the
first track, the models were evaluated using a test
1 Introduction set consisting of additional dime novels. In this
track, the system presented in this paper achieved
Scene segmentation in narrative texts is a novel task the second place with an F1 of 0.16. In the second
in natural language processing introduced by Zehe track, domain-adaptability was probed by evaluat-
et al. (2021a). The aim of this task is to segment ing the systems on a set of German contemporary
pieces of literature into scenes – sections of text highbrow literature. Here, the system presented
where the relation of story time and discourse time, performed better and was ranked first with an F1
the location and character constellations stay more of 0.26.
or less the same. From a formal point of view,
this problem can be interpreted as a sentence in 2 Background
context classification task where sentences separat-
ing scenes have to be distinguished from in-scene 2.1 Task Description
ones. This is needed as the typical length of longer In Zehe et al. (2021a), the authors interpreted the
narrative texts such as novels prevents techniques task of scene segmentation as a sentence classifi-
such as co-reference resolution useful for proceed- cation task. They defined four different classes
ing steps of analysis from functioning well (Zehe of sentences: no border, scene-to-scene, scene-to-
et al., 2021a). With a text being segmented into nonscene and nonscene-to-scene. The three latter
coherent scenes, each scene can be processed sepa- of these are used to mark the different kinds of tex-
42
Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
tual borders among the sentences. They trained a 3 System Description
BERT-based (Devlin et al., 2019) classifier utilis-
ing a sliding windows over multiple sentences for My code can be found under 1 .
context encoding to carry out sentence classifica-
tion. 3.1 Adjustments to the Tag Set
This approach was evaluated against the unsu- While Zehe et al. (2021a) used a quaternary tag
pervised TextTiling (Hearst, 1997) and TopicTiling set which distinguished scene to scene- and non-
(Riedl and Biemann, 2012) methods on a corpus scene to scene borders which is also used for offi-
consisting of 15 German dime novels using cross cial shared task evaluations, my system internally
validation. While the supervised BERT model relies on a tertiary tag set consisting of the tags
achieved superior results (γ 0.15) compared to the O, SCENE and NONSCENE. The latter two refer
unsupervised methods (γ 0.01; γ 0.02), the over- to the first sentence of an according section. The
all results turned out subpar which led the authors reason for this adjustment is that the number of
conclude that scene segmentation can be regarded border sentences is low compared to the number
as an inherently hard task. of non-border sentences. My tertiary tag set is
the smallest classification setup which can be used
For the KONVENS 2021 shared task, the orga- to distinguish scenes and non-scenes. Using this
nizers provided an expanded version of the data tertiary tagset results in all scene to scene- and non-
set presented by Zehe et al. (2021a). This data set scene to scene sentences being grouped under the
is composed of various German dime novels. The SCENE task, and all scene-to-nonscene ones under
authors chose this genre as they deemed it easier the NONSCENE.
for potential models to deal with.
3.2 Twin BERT Embedding Space Learning
2.2 Related Work My system is built around the idea of neural em-
bedding space learning. Reimers and Gurevych
While segmenting text into smaller units such as (2019) introduced the idea of using twin and triplet
tokens, sentences or spans is one of the oldest and network-based training setups for fine-tuning trans-
most researched topics in natural language process- former language models to map sentences into
ing, the task of semantically segmenting narrative meaningful semantic vector spaces under the name
texts into scenes is a new one. In this form, scene Sentence Transformers. In their training setup,
segmentation was first introduced by Zehe et al. two or three different sentences are fed into the
(2021a). From a problem-centric point of view, same transformer language model. These pairs
Zehe et al. (2021a) relate scene segmentation to the and triplets of sentences are assigned scores such
task of topic segmentation, the task of segmenting as cosine similarity or concrete training labels. A
a text by topic changes, as changes of time, place prediction head which is fed the output of the trans-
and character constellation can be interpreted as a former language model for all two or three sen-
special cases of topic changes. tences is trained to predict the assigned scores or
labels. After this training process, the transformer
Most of the more recent work in this area (Riedl
language model can embed sentences into a vector
and Biemann, 2012; Misra et al., 2011) is built
space where they are well-separated according to
upon latent Dirichlet allocation (Blei et al., 2003).
the respective training objective.
This method discovers fields of words consistently
The idea behind the system presented in this pa-
co-occuring in the same contexts. By monitoring
per is to combine this approach of twin network
changes in their distribution throughout a text, one
embedding space learning with the sliding window-
can define topic-wise section borders. Another
based approach from Zehe et al. (2021a). More
related topic according to Zehe et al. (2021a) is dis-
precisely, my approach is to utilise a twin network-
course coherence. Recent approaches in this area
based training setup to learn an embedding space
rely on neural networks to detect textual coherence
encoding information about a sentence as well as
in various setups and use cases (Li and Jurafsky,
the sentences surrounding it. The goal here is that,
2017; Pichotta and Mooney, 2016). Changes in
these coherence scores can be used for detecting 1
https://github.com/SGombert/
borders within texts, as well. ssts-2021-sego
43
were both either scene- or non-scene borders and
15000 pairs where both sentences were from differ-
ent categories, the majority of them being pairs of
scene border and in-scene sentences, from the train-
ing set. While the prior set of pairs is assigned a
score of 1, the pairs from the latter set are assigned
a score of -1.
Figure 1: The architecture of the neural network model
in prediction mode when generating contextualized sen-
tence embeddings. mconcat (p) = m(s1 (p)) ⊕ m(s2 (p)) (6)
within this vector space, the embeddings of sen- f (p) = L(mconcat (p)) (7)
tences bordering scenes are well-separated from
them of in-scene ones. In these equations, p refers to a triple of two sen-
Instead of a single BERT model as (Reimers tences from the training set and an according score
and Gurevych, 2019), it uses two of them with one (-1 or 1, depending on class equality), s1 (p) and
functioning as sentence encoder and one as context s2 (p) are functions retrieving the first respectively
encoder. In both cases, the regular pooling layer second sentence from a given training input triple.
output of these networks is used to encode given f (p) refers to the final output score calculated by
input sentences. While the sentence encoder is only the network during training and L to a linear feed-
used to predict a sentence embedding for a given forward layer. During training both sentences of a
target sentence, the context encoder also predicts triple and their according local context sentences
sentence embeddings for a context window of n are propagated through both the sentence respec-
sentences to the left and to the right around this tively the context encoders. Their pooling layer
target sentence. The output of both encoders is outputs for both sentences are concatenated and
concatenated to acquire the final embeddings for propagated into a linear layer whose single output
embedding a sentence and its context into vector neuron is trained to predict the according score
space. using hinge embedding loss:
(
x if y = 1
m(st ) = esent (st ) ⊕ econt (st ) (1) f (x, y) = (8)
max(0, δ − x) if y = -1
esent (st ) = B1 (st ) (2) Within this function, x is a predicted score, y
a gold standard one and δ the so-called margin, a
hyper parameter which can be used to control the
econt (st ) = clef t (st ) ⊕ B2 (st ) ⊕ cright (st ) (3)
distances between the vectors a given model learns.
This function is used to learn a maximum margin-
clef t (st ) = B2 (st−n ) ⊕ · · · ⊕ B2 (st−1 ) (4) like embedding space which separates scene bor-
ders from in-scene sentences.
cright (st ) = B2 (st+1 ) ⊕ · · · ⊕ B2 (st+n ) (5) The GermanBERT variant provided by Hugging-
face Transformers (Wolf et al., 2020) under the
In these equations, st is a given sentence at time id bert-base-german-dbmdz-uncased2 is used as
step (position in text) t. m(s) refers to the func- a base for both sentence encoder and context en-
tion used for predicting embeddings. esent(s) and coder. The reason for choosing this model was
econt (s) are the two different encoder networks. B1 that the data it was pre-trained on includes nar-
and B2 refer to the two underlying BERT networks, rative texts which makes it an appropriate basis
and clef t (st ) and cright (st ) are the functions used for a model dealing with literary data. The model
for acquiring the context of a given sentence st . n was trained using AdamW (Kingma and Ba, 2015;
determines the size of this context. Loshchilov and Hutter, 2019) with the learning rate
For training such a sentence embedding model, I 2
https://huggingface.co/
randomly sampled 15000 pairs of sentences which bert-base-german-dbmdz-uncased
44
Figure 2: A visualisation of the twin network-based training setup.
set to 0.000001 and weight decay to 0.0001. The tree is trained to correct erroneous predictions of
embedding model was trained for one epoch using the previous ones. As each of them is limited to use
a constant warm up schedule with a constantly in- only a small subset of the input features provided
creasing learning rate for the first 1000 iterations. in given input feature vectors, the trained ensemble
No batch processing was used during training. can automatically isolate features which globally
As visible in figure 3, the model indeed learned distinguish scene borders from in-scene sentences
to embed sentences into a vector space in which the best within the training set.
they were well-separated into two distinct clusters. For implementing this part of the system, I used
However, it does not seem that the model gener- Catboost (Prokhorenkova et al., 2018) as frame-
alized the idea of what exactly is a scene border work. The model is based upon its multi class
well from the training data. While for ’Der kleine classification mode. The tree growth policy is set
Chinesengott’, the German dime novel provided to lossguide and class weights are used. The fol-
as trial corpus, the majority of scene borders is lo- lowing formula is used for calculating them:
cated in the smaller of the two clusters, there are
also borders located in the larger cluster, and, more-
num(c)
over, many in-scene sentences are also sorted into wc = 1 − PC (9)
0
the smaller cluster. This phenomenon was visible c0 num(c )
after multiple training runs with different sampled
wc is a respective class weight, c a class, C the
pairs of sentences which implies that drawing clear
set of all classes, c and c0 classes and num(c) a
distinctions between scene borders and in-scene
function which returns the number of training ex-
sentences is hard for solely BERT-based models.
amples for a given class. Additionally, I used early
3.3 Gradient Boosted Decision Tree stopping to prevent overfitting. For this, I set the
Ensembles number of training iterations to 5000, let the frame-
work choose a learning rate automatically, and then
As the embedding model did seemingly not learn used the checkpoint of the model which performed
a precise enough distinction between scene bor- best on the trial dime novel.
ders and in-scene sentences, using maximum mar-
gin classification with the resulting embeddings 4 Evaluation
as feature vectors was no option. Instead, I chose
gradient boosted decision tree ensembles (Mason 4.1 Results
et al., 1999) as classification algorithm because of Shared task evaluations were carried out on two
its ability to select distinctive features and ignore different corpora resulting in two different evalu-
less distinctive ones. ation tracks. The first of these corpora consisted
During training, this algorithm creates an ensem- of 5 more dime novels similar to the ones systems
ble of weak regression trees trained to predict the were trained on to address in-domain transfer ca-
logits within a specialized logistic regression setup. pabilities of the participating systems. The corpus
Combining enough of such trees results in a strong used for the second track consisted of two pieces of
learner. This is conducted by means of gradient de- highbrow German literature. The aim of this track
scent and decision tree learning. Each subsequent was to evaluate out-of-domain transfer capabilities
45
Figure 3: The embeddings predicted for the sentences from the dime novel ’Der kleine Chinesengott’ used as trial
data in the shared task visualized in 2D using principal component analysis (Pearson, 1901). 0/brown corresponds
to in-scene sentences, 1/green to scene borders and 2/blue to non-scene borders.
of the participating systems. My system ranked sec- marked as scene borders within the trial corpus
ond out of four in the first track reaching a micro were false positives. What became quickly visible
F1 of 0.16 and first out of five in the second track was that some false positives contained changes
reaching a micro F1 score of 0.26. These results of time, character constellations and/or location.
confirm the difficulty of this task observed by Zehe As these function as important signals for a scene
et al. (2021a). change, the model seems to have overgeneralized
such cases. The following utterances are examples
Track F1 γ Rank for a signified change in time from false positives:
Dime Novels 0.16 0.085 2/4
Highbrow Literature 0.26 0.175 1/5 Langsam verstrich die Zeit.
Natürlich kamen wir zu spät.
unendlich langsam verstrich die Zeit [...].
Table 1: The shared task evaluation results of my sys- Ich wartete also noch eine Weile, dann aber [...]
tem. Gerade in dem Moment vernahm ich [...]
Examples for a change in character constellation
4.2 Qualitative Error Analysis are the following:
To further analyze the results of my system, I turned Bills Alarmruf hatte den Spitzbuben verscheucht.
to qualitative error analysis. For this purpose, I Der Verfolger war [...] untergetaucht.
Da hörte ich Tom plötzlich aufstehen [...].
collected the false negative and false positive scene Tom erhob sich jetzt und entschuldigte sich [...].
border sentences detected by my system for the trial Dem herbeieilenden Portier berichtete ich [...].
corpus and analyzed a selection of them with regard Ich war wieder allein [...].
Bill meldete in diesem Moment den Besuch Dr. Türks.
to common structural patterns. 128 of the sentences Ich fand ihn ohnmächtig auf dem Fußboden liegen.
46
The following utterances are examples for a lo- The goal behind this was to train a model which
cation change: would be able to embed sentences into a vector
Wir verließen unser Häuschen [...]. space in which sentences functioning as scene bor-
”Schnell, zu Wertheim,” raunte Tom mir zu. ders would be well-separated from in-scene ones
Wir trafen uns erst wieder draußen in der Linienstraße.
Wir durchsuchten noch einmal das Arbeitszimmer [...].
which could then be used as feature vectors in reg-
Endlich erreichten wir den kleinen Antiquitätenladen. ular classification. While the model indeed learned
Ich fuhr zur Linienstraße. a vector space in which sentences were more or
Dann aber schlich ich mich in den dunklen Hausflur.
less sorted into two distinct clusters, these clusters
Most false positive sentences mention time, char- did not seem to capture a general understanding of
acters or location without explicitly signifying a the concept of scene borders. This is shown by the
change. This speaks for the assumption that the observation that gold standard scene borders from
model might have overgeneralized these signals: the trial set were sorted into both clusters when
In der Nähe des schlesischen Bahnhofs. embedded by the model.
”Tom, was tust Du, mußte das sein!”
Bill lag wieder still. For this reason, gradient boosting was chosen
Auch Tom lauschte und schien unschlüssig zu sein. as a subsequent classification algorithm for its abil-
Isaak Kornblum besaß Telephon.
Ich tat es. ity to isolate a subset of features which would still
be able to separate classes well. Early stopping
On the other hand, many of the false negatives
was used during training, meaning that the model
contain similar signals. This puts the assumption
was trained for 5000 iterations on the shared task
that the model might have overgeneralized upon
training data and the iteration of the model which
such signals into question. Of course, one needs
achieved best results on the trial data set was cho-
to consider that the majority of dimensions of the
sen as final. This achieved comparably poor results
respective embeddings encode sentences from the
with micro F1 scores of 0.16 for track 1 respec-
context of a particular target sentence. Given this
tively 0.26 for track 2. Nonetheless, these results
fact in combination that with the observation that
were sufficient for ranks 2/4 respectively 1/5 in the
false positives and false negatives share similar pat-
two tracks.
terns, it seems very likely that these local context
sentences have played a major role for classifica- It is an interesting observation that my system
tion. The following utterances are examples for performs better for highbrow literature in spite of
false negatives: the fact that its training data consisted solely of
dime novels as it contradicts the assumption of the
Tom eilte jetzt die Treppe empor [...].
Mein Weg ging über die Gartenmauer. authors that dime novels would be potentially eas-
Dann verschwand er lautlos durch die Vordiele. ier to deal with for participating systems compared
Wir [...] verließen schnell den Laden.
dann stieg er die Leiter empor.
to highbrow literature. A possible explanation for
Tom verschwand schnell durch die Verbindungstür [...]. this could lie in the more formal nature of high-
brow literature which might result in more regu-
5 Conclusion & Outlook larities that are useful for successful classification.
I presented my submission to the shared task on However, without further inspection, this remains
scene segmentation at KONVENS 2021, a system speculation.
aimed at segmenting German narrativew texts into Further work could be the optimization of the
distinct scenes, spans of text where character con- architecture and training procedure of the contextu-
stellations, discourse- and story time, and locations alized sentence embedding model presented in this
stay more or less the same. For its implementation, paper. This might lead to improved downstream
the task was interpreted as a sentence in context training results. Moreover, as gradient boosting
classification task. For solving this task, I first functions as feature-based learning algorithm, it
trained a neural model consisting of two German- could be an option to combine contextualized sen-
BERT networks, the sentence encoder and context tence embeddings with statistical and hand-crafted
encoder, which, in conjunction, predict contextual- features for representing sentences in context. In
ized sentence embeddings. This was conducted in a general, it can be said that the problem is far from
twin network setup where triplets of two sentences solved as sugggested by the poor results. How-
and an according score were fed to a a linear layer ever, the idea of learning contextualized sentence
responsible for predicting such an according score. embeddings and the optimization of the according
47
training procedure could be a useful option to for Liudmila Ostroumova Prokhorenkova, Gleb Gusev,
future work on the topic. Aleksandr Vorobev, Anna Veronika Dorogush, and
Andrey Gulin. 2018. Catboost: unbiased boost-
ing with categorical features. In Advances in Neu-
ral Information Processing Systems 31: Annual
References Conference on Neural Information Processing Sys-
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. tems 2018, NeurIPS 2018, December 3-8, 2018,
2003. Latent dirichlet allocation. Journal of Ma- Montréal, Canada, pages 6639–6649.
chine Learning Research, 3(4-5):993–1022.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and BERT: Sentence embeddings using Siamese BERT-
Kristina Toutanova. 2019. BERT: Pre-training of networks. In Proceedings of the 2019 Conference on
deep bidirectional transformers for language under- Empirical Methods in Natural Language Processing
standing. In Proceedings of the 2019 Conference and the 9th International Joint Conference on Natu-
of the North American Chapter of the Association ral Language Processing (EMNLP-IJCNLP), pages
for Computational Linguistics: Human Language 3982–3992, Hong Kong, China. Association for
Technologies, Volume 1 (Long and Short Papers), Computational Linguistics.
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics. Martin Riedl and Chris Biemann. 2012. TopicTiling:
A text segmentation algorithm based on LDA. In
Marti A. Hearst. 1997. Text tiling: Segmenting text Proceedings of ACL 2012 Student Research Work-
into multi-paragraph subtopic passages. Computa- shop, pages 37–42, Jeju Island, Korea. Association
tional Linguistics, 23(1):33–64. for Computational Linguistics.
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
method for stochastic optimization. In 3rd Inter- Chaumond, Clement Delangue, Anthony Moi, Pier-
national Conference on Learning Representations, ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
ICLR 2015, San Diego, CA, USA, May 7-9, 2015, icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Conference Track Proceedings. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Jiwei Li and Dan Jurafsky. 2017. Neural net models Quentin Lhoest, and Alexander Rush. 2020. Trans-
of open-domain discourse coherence. In Proceed- formers: State-of-the-art natural language process-
ings of the 2017 Conference on Empirical Methods ing. In Proceedings of the 2020 Conference on Em-
in Natural Language Processing, pages 198–209, pirical Methods in Natural Language Processing:
Copenhagen, Denmark. Association for Computa- System Demonstrations, pages 38–45, Online. Asso-
tional Linguistics. ciation for Computational Linguistics.
Ilya Loshchilov and Frank Hutter. 2019. Decou- Albin Zehe, Leonard Konle, Lea Katharina
pled weight decay regularization. In 7th Inter- Dümpelmann, Evelyn Gius, Andreas Hotho,
national Conference on Learning Representations, Fotis Jannidis, Lucas Kaufmann, Markus Krug,
ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. Frank Puppe, Nils Reiter, Annekea Schreiber, and
OpenReview.net. Nathalie Wiedmer. 2021a. Detecting scenes in
fiction: A new segmentation task. In Proceedings of
Llew Mason, Jonathan Baxter, Peter Bartlett, and Mar- the 16th Conference of the European Chapter of the
cus Frean. 1999. Boosting algorithms as gradient de- Association for Computational Linguistics: Main
scent. In Proceedings of the 12th International Con- Volume, pages 3167–3177, Online. Association for
ference on Neural Information Processing Systems, Computational Linguistics.
NIPS’99, page 512–518, Cambridge, MA, USA.
MIT Press. Albin Zehe, Leonard Konle, Svenja Guhr, Lea Katha-
rina Dümpelmann, Evelyn Gius, Andreas Hotho, Fo-
Hemant Misra, François Yvon, Olivier Cappé, and Joe- tis Jannidis, Lucas Kaufmann, Markus Krug, Frank
mon Jose. 2011. Text segmentation: A topic model- Puppe, Nils Reiter, and Annekea Schreiber. 2021b.
ing perspective. Information Processing & Manage- Shared task on scene segmentation@konvens2021.
ment, 47(4):528–544. In Shared Task on Scene Segmentation.
Karl Pearson. 1901. LIII. on lines and planes of clos-
est fit to systems of points in space. The London,
Edinburgh, and Dublin Philosophical Magazine and
Journal of Science, 2(11):559–572.
Karl Pichotta and Raymond J. Mooney. 2016. Learn-
ing statistical scripts with lstm recurrent neural net-
works. In Proceedings of the Thirtieth AAAI Con-
ference on Artificial Intelligence, AAAI’16, page
2800–2806. AAAI Press.
48