contains the name of the character that
is currently speaking; and finally and , which contain the text that is spoken.
We use single sentences as input for the BERT model. This is a compromise, as some
tags will only cover single words or incomplete sentences. Using whole sentences as input has
advantages though, as it gives more context to the model and it is straightforward to later
assign XML tags for sentences or groups of sentences instead of arbitrary sub-sequences.
2
https://tei-c.org/
219
Some tags will also have multiple lines of text stored inside them. We therefore use the
NLTK Sentence Tokenizer to split groups of multiple sentences into single sentences, which
can be given as input to our model. The extracted plain text is passed to the tokenizer. The
text coming from inside the speaker tags is also considered as a sentence, even though it usually
just constitutes a name. Figure 2 shows a part of our final pre-processed dataset which we
use for training and testing. While the first column SENTENCE contains all the tokenized
sentences, the second column Decider contains numeric class labels for the tokenized sentences.
Figure 2: Dataset after preprocessing. 0 stands for stage direction, 1 for speaker tag and 2 for an utterance.
Overall, the dataset contains 1 410 783 sentences with 10 021 598 tokens and 240 794 types.
Since the sentences in the dataset are of varying length, we use padding to make all sequences
have the same length. Since the vast majority of sentences has ten or less tokens, we set the
maximal sequence length to 10.
5. Experiments
5.1. Baseline
We implement a simple baseline to compare the results of the transformer models against. For
the baseline, we choose a conditional random fields (CRF) model, which is able to consider
sequential information. To make the baseline comparable to the BERT models, we also choose
sentences as the input and let the model predict if a sentence belongs to one of five classes: act
(0), scene (1), stage direction (3), speaker tag (4) or utterance (5). The CRF receives features
extracted from each sentence, namely:
• The lower-cased surface string of the sentence.
• If the sentence contains the German word ‘Akt’.
• If the sentence contains the German words ‘Szene’ or ‘Scene’.
• If the sentence begins with an uppercase letter.
220
• If the sentence only contains uppercase letters.
• If the sentence contains a digit.
For training, we make use of the limited-memory BFGS algorithm and elastic net regular-
ization.
5.2. Experimental Setup
Training For the BERT models, we use pre-trained models provided by HuggingFace3 to
fine-tune on. We use the AdamW algorithm [5] which is an improved version of Adam to train,
with a batch size of 256 and we clip the norm of the gradients at 1, as an extra safety measure
against exploding gradients. The model is implemented in PyTorch [7] and scikit-learn [8].
We use negative log likelihood as loss function, and apply a learning rate equal to 2e−5. The
training runs for 20 epochs.
Table 1
Distribution of classes on the dataset.
Class Count Rel. Count in Percent
Act 1,458 0.001
Scene 11,001 0.008
Stage 175,238 0.124
Speaker 316,451 0.224
Speech 906,635 0.642
Class Weights Table 1 shows how the classes are distributed among the sentences. Some
classes have much more training examples than others, introducing bias in our models. To
deal with this problem, we apply class weights to the loss function. These are computed as
the inverse frequency of the classes in the training set.
5.3. Evaluation
We consider accuracy, precision, recall and F1-score as metrics.
5.4. Results
In this section, we investigate the performance of our proposed model on various tasks. We
split the dataset randomly into three sets: train, validation, and test, where the train set is
70% and validation and test set are both 15% of the overall data. We fine-tune the model
using the train and validation set, and evaluate on the test set.
Detecting Act Boundaries We extract the data from , and . The
sentence splitter recognizes 1 208 899 sentences. The goal of the prediction is to mark all
sentences: If the sentence is the first sentence of an act it is classified as 1, otherwise as
0. Hence it is a binary classification task. For this task we use ’bert-base-uncased’ (the
3
https://huggingface.co/
221
identifier at HuggingFace) as a base model. In Table 2 we can see the results as classification
report and confusion matrix.
We can see that the model is able to predict a non-boundary for nearly 100 % of the cases.
Yet, the model is not overfitting on class 0, as the prediction for an act boundary still gets
high scores with an F1-score of 0.89. From Table 2b we can see that the model makes more
mistakes in wrongly classifiying non-act-boundaries as act changes than the other way around
(i.e., the number of false positives is higher than the number of false negatives).
Table 2
Evaluation results for detecting act boundaries.
precision recall f1-score support
GS
0 1.00 1.00 1.00 181 046 0 1
1 0.93 0.85 0.89 289
accuracy 1.00 181 335 0 181027 19
SO
macro avg 0.96 0.93 0.94 181 335 1 42 247
weighted avg 1.00 1.00 1.00 181 335 (b) Confusion Matrix. GS
(a) Classification Report. is the gold standard,
SO the system output.
Detecting Stage Directions For this task we use the and tags, which
contain 1 203 911 sentences. Each sentence is classified as to whether it is (part of a) stage
direction or not, using the ’bert-base-uncased’ model. As stage directions are much more
frequent and much more similar to character speech, we consider this task to be more difficult
than the one discussed above: Instead of relying on lexical cues, it needs to take discourse
structure and semantic information into account.
Table 3 shows the results. In this, 0 represents character speech and 1 represents stage
direction. The model has indeed more difficulties with correctly predicting stage directions.
With a precision of 0.7 and recall of 0.95, the detection nevertheless performs reasonably well.
The confusion matrix shows that false positives are much more common than false negatives,
which can probably be explained by the imbalance in the training dataset.
Table 3
Evaluation results for detecting Stage direction and character speech.
precision recall f1-score support
GS
0 0.99 0.92 0.95 151409 0 1
1 0.70 0.95 0.81 29178
accuracy 0.93 180587 0 139324 12085
SO
macro avg 0.84 0.94 0.88 180587 1 1365 27813
weighted avg 0.94 0.93 0.93 180587 (b) Confusion Matrix. GS is
(a) Classification Report. the gold standard, SO the
system output.
222
Table 4
Evaluation results when using the model Bert_Uncased
precision recall f1-score support
GS
0 0.97 0.97 0.97 252 0 1 2 3 4
1 0.98 0.99 0.98 1820
2 0.67 0.92 0.77 29591 0 244 1 2 3 2
3 0.97 0.99 0.98 52912 1 7 1795 11 1 6
4 0.97 0.91 0.95 153789 2 0 3 27281 1195 1112
SO
accuracy 0.93 238364 3 0 0 339 52552 21
macro avg 0.92 0.96 0.93 238364 4 0 29 13324 366 140070
weighted avg 0.95 0.93 0.94 238364 (b) Confusion Matrix. GS is the gold standard, SO the
(a) Classification Report. system output.
Table 5
Evaluation results when using the model BERT_German_Uncased
precision recall f1-score support
GS
0 1.00 0.95 0.97 252 0 1 2 3 4
1 0.97 0.99 0.98 1820
2 0.77 0.93 0.84 29591 0 240 7 3 0 2
3 0.97 0.99 0.98 52912 1 1 1808 6 2 3
4 0.99 0.95 0.97 153789 2 0 14 27526 1149 902
SO
accuracy 0.95 238364 3 0 3 411 52425 7
macro avg 0.94 0.96 0.95 238364 4 0 32 7885 237 145635
weighted avg 0.96 0.95 0.96 238364 (b) Confusion Matrix. GS is the gold standard, SO the
(a) Classification Report. system output.
All tasks combined For this task, we extract the data from the all above mentioned tags,
which in total contain 1 589 090 sentences. The task now is a 5-way classification, as we classify
sentences as being (part of) a stage direction (2), name of a speaker (3), character speech (4)
or act (0) or scene boundary (1).
For this task we use different types of BERT models and compare them. Table 4 shows
results for ’bert-base-uncased’. All results are still comparable to the results of classifying
the tags individually. Some of the results are lower, but not by much. This is promising,
as it shows that we can potentially predict the complete structure of a plain text drama at
once without loosing much in predictive power over classifying the single types of structure
individually.
As mentioned earlier, all models so far have been pre-trained on English data. The above
evaluation shows that even on German data, they can make good predictions, which can be
explained by the fact that most of the distinguishing features needed so far for prediction are
structural rather than content-based. However, for the task of predicting all tags together,
we now use a model trained on German data and see if the results can be further improved.
Table 5 shows the results for applying the ’bert-base-german-uncased’ model. We can
see that especially for predicting stage directions, the performance improves significantly by 7
percentage points F1 score. The other results are either identical or slightly higher in the case
223
Table 6
Evaluation results when using the model Bert_German_Cased
precision recall f1-score support
GS
0 0.96 0.97 0.97 252 0 1 2 3 4
1 0.95 0.99 0.97 1820
2 0.72 0.96 0.82 29591 0 245 1 5 0 1
3 0.96 1.00 0.98 52912 1 5 1800 5 7 3
4 1.00 0.92 0.96 152789 2 5 10 28406 736 434
SO
accuracy 0.94 238364 3 0 1 117 52780 14
macro avg 0.92 0.97 0.94 238364 4 0 73 10914 1575 141227
weighted avg 0.95 0.94 0.94 238364 (b) Confusion Matrix. GS is the gold standard, SO the
(a) Classification Report. system output.
Table 7
Evaluation results of the CRF baseline.
precision recall f1-score support
GS
0 1.0 0.88 0.93 252 0 1 2 3 4
1 0.99 0.91 0.95 1820
2 0.86 0.30 0.44 29591 0 172 0 0 3 20
3 0.99 0.92 0.95 52912 1 0 1375 1 15 118
4 0.86 0.99 0.92 152789 2 0 9 7108 30 16479
SO
accuracy 0.89 238364 3 0 0 23 46162 3789
macro avg 0.94 0.80 0.84 238364 4 0 3 1047 41 135223
weighted avg 0.90 0.89 0.88 238364 (b) Confusion Matrix. GS is the gold standard, SO
(a) Classification Report. the system output.
of speech with a plus of 2 percentage points. This is absolutely expected, as these two types
are more content based. Still, the English model is able to pick up on enough structural cues
to also predict well on German data.
Lastly, we check if it makes a difference to use a model that was trained on cased data, as
all other models before were trained on uncased data. Here, the ’bert-base-german-cased’
model has been used. The results for this can be found in Table 6, and are sightly lower than
in the uncased setting. This suggests that preserving case lets the model generalize less well.
Baseline We compare these final results to the baseline system. The results are shown in
Table 7. The baseline performs rather well for the tasks of predicting act and scene boundaries
and recognizing speaker tags. However, the BERT-based models achieve slightly higher values
for all these classes. For the task of character speech identification, it performs worse than the
BERT-based models in term of precision, but achieves a higher recall than all other models.
For the crucial task of recognizing stage directions, it returns a rather low recall value, but the
highest precision value.
224
5.5. Summary of the results
In all experiments, we observe that the models achieve precision and recall scores around 95 %
to 99 % for most of the categories. For stage directions, the evaluation yields lower scores:
The model misclassifies some of the sentences in character speech as stage directions. By
experimenting with different BERT models we are able to achieve a precision of 77 % for stage
directions which means that BERT German Uncased is the most suitable model for these
predictions. While the CRF-based model sets a high baseline for the tasks of act, scene and
speaker recognition, the BERT-based models outperform the baseline in all measures. Only
for the content-based tasks of speech and stage direction recognition, the baseline achieves
higher results in recall and precision, respectively. In future work, the transformer models
might benefit from combining them with the CRF model.
6. Conclusion and Future work
In this paper, we have shown that the BERT model is a reasonable model for predicting
and extracting structural segments from dramatic texts. Based on this finding, we have pro-
posed a novel fine-tuned model based on BERT. From the above results we can conclude that
’BERT_German_Uncased’ is the most effective base model. We can also conclude from
the above results that all the models perform quite well, whether we predict the segments with
binary classification or in the full model with five classes. We were further able to show that
models trained on English data are able to predict the more structural elements of German
dramatic texts with high accuracy. However, for the structural elements that rely more on
text content, a model trained on German data performs better.
Both recall and precision for all classes except class 2 (stage directions), are quite high which
means that the model predicts these classes accurately. The recall for class 2 is 0.93 which
means that the model was able to find 93 % of the stage direction sentences. However, precision
is a bit lower for class 2, which means that the model misclassifies some of the class 4 sentences
(character speech) as stage.
In the future, we plan to extend on the presented work to create a fully automatic map-
ping tool to convert plain text scans of dramatic texts into properly structured TEI/XML
documents. Even if this automatic conversion is likely to contain some errors, correcting it
manually is much less labor-intensive than coding the entire play by hand. We plan to add
texts currently only available in plain text to the DraCor corpora once the above mentioned
tool is developed and functioning. One challenge we will most likely face will be that OCRed
texts usually contain mistakes which might throw of the transformer model. Hence we will
also experiment with text normalization techniques. This opens a path towards large scale
data analysis of plays that currently are not available as part of the DraCor repository. In
addition, our analysis has shown that the trained model works reasonably well even if used
across language boundaries. This suggests that it is also possible to apply a very similar model
on plays from other languages, as training data for many languages is already available.
Acknowledgements
The first and third author have conducted the described research within the QuaDramA
project, funded by the Volkswagen foundation and within the Q:TRACK project, funded by
225
the German Research Foundation (DFG) in the context of SPP 2207 Computational Literary
Studies. We thank both for making this possible.
References
[1] A. Doucet, G. Kazai, S. Colutto, and G. Mühlberger. “Overview of the ICDAR 2013
Competition on Book Structure Extraction”. In: Proceedings of the Twelfth International
Conference on Document Analysis and Recognition (ICDAR). Washington, D.C., US,
2013, pp. 1438–1443.
[2] F. Fischer, I. Börner, M. Göbel, A. Hechtl, C. Kittel, C. Milling, and P. Trilcke. “Pro-
grammable Corpora: Introducing DraCor, an Infrastructure for the Research on Euro-
pean Drama”. In: Proceedings of DH2019: ”Complexities”. Utrecht, The Netherlands,
2019. doi: 10.5281/zenodo.4284002.
[3] M. A. Hearst. “TextTiling: Segmenting Text into Multi-paragraph Subtopic Passages”.
In: Computational Linguistics 23.1 (1997), pp. 33–64. url: https://www.aclweb.org/
anthology/J97-1003.
[4] A. K. John, L. Di Caro, and G. Boella. “Text Segmentation with Topic Modeling and
Entity Coherence”. In: Proceedings of the 16th International Conference on Hybrid In-
telligent Systems (HIS). Ed. by A. Abraham, A. Haqiq, A. M. Alimi, G. Mezzour, N.
Rokbani, and A. K. Muda. Vol. 552. Advances in Intelligent Systems and Computing
(AISC). Springer, 2017, pp. 175–185. doi: 10 . 1007 / 978 - 3 - 319 - 52941 - 7 \ _18. url:
https://link.springer.com/chapter/10.1007%5C%2F978-3-319-52941-7%5C%5F18.
[5] I. Loshchilov and F. Hutter. “Decoupled Weight Decay Regularization”. In: International
Conference on Learning Representations. 2019. url: https://openreview.net/forum?id=
Bkg6RiCqY7.
[6] L. McConnaughey, J. Dai, and D. Bamman. “The Labeled Segmentation of Printed
Books”. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing (EMNLP). Copenhagen, Denmark, 2017, pp. 737–747. doi: 10.18653/v1/D17-
1077. url: https://aclanthology.org/D17-1077.
[7] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.
Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. “PyTorch: An Imperative
Style, High-Performance Deep Learning Library”. In: Advances in Neural Information
Processing Systems 32. Ed. by H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-
Buc, E. Fox, and R. Garnett. Curran Associates, Inc., 2019, pp. 8024–8035. url: http:
//papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-
learning-library.pdf.
[8] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M.
Brucher, M. Perrot, and E. Duchesnay. “Scikit-learn: Machine Learning in Python”. In:
Journal of Machine Learning Research 12 (2011), pp. 2825–2830.
226
[9] C. Pethe, A. Kim, and S. Skiena. “Chapter Captor: Text Segmentation in Novels”. In:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing
(EMNLP). 2020, pp. 8373–8383. doi: 10.18653/v1/2020.emnlp- main.672. url: https:
//aclanthology.org/2020.emnlp-main.672.
[10] M. Pfister. The Theory and Analysis of Drama. Trans. by J. Halliday. European Studies
in English Literature. Cambridge: Cambridge University Press, 1988. doi: 10 . 1017 /
cbo9780511553998.
[11] A. Zehe, L. Konle, L. Dümpelmann, E. Gius, A. Hotho, F. Jannidis, L. Kaufmann, M.
Krug, F. Puppe, N. Reiter, A. Schreiber, and N. Wiedmer. “Detecting Scenes in Fiction:
A new Segmentation Task”. In: Proceedings of the 16th Conference of the European
Chapter of the Association for Computational Linguistics: Main Volume. Association
for Computational Linguistics, 2021, pp. 3167–3177. url: https : / / www . aclweb . org /
anthology/2021.eacl-main.276.
227