Recognising non-named spatial entities in literary
                                texts: a novel spatial entities classifier⋆
                                Daniel Kababgi1 , Gulia Grisot2 , Federico Pennino3 and Berenike Herrmann1
                                1
                                  Universität Bielefeld
                                2
                                  University of Cambridge
                                3
                                  Università di Bologna


                                              Abstract
                                              Predicting spatial representations in literature is a challenging task that requires advanced machine
                                              learning methods and manual annotations. In this paper, we present a study that leverages manual an-
                                              notations and a BERT language model to automatically detect and recognise non-named spatial entities
                                              in a historical corpus of Swiss novels. The annotated data, consisting of Swiss narrative texts in German
                                              from the period of 1840 to 1950, was used to train the machine learning model and fine-tune a deep learn-
                                              ing model specifically for literary German. The annotation process, facilitated by the use of Prodigy,
                                              enabled iterative improvement of the model’s predictions by selecting informative instances from the
                                              unlabelled data. Our evaluation metrics (F1 score) demonstrate the model’s ability to predict various
                                              categories of spatial entities in our corpus. This new method enables researchers to explore spatial
                                              representations in literary text, contributing both to digital humanities and literary studies. While our
                                              study shows promising results, we acknowledge challenges such as representativeness of the annotated
                                              data, biases in manual annotations, and domain-specific language. By addressing these limitations and
                                              discussing the implications of our findings, we provide a foundation for future research in sentiment and
                                              spatial analysis in literature. Our findings not only contribute to the understanding of literary narratives
                                              but also demonstrate the potential of automated spatial analysis in historical and literary research.

                                              Keywords
                                              Computational Literary Studies, language model, spatial humanities, token classification


                                1. Introduction
                                Building on previous work examining fictional space and sentiment in Swiss-German narrative
                                [7], this paper reports on the development and evaluation of a novel machine learning model
                                for the analysis of fictional space in literary texts.1
                                   In recent criticism across disciplines there has been an increased emphasis towards consider-
                                ing place and space as crucial factors in understanding social, cultural, and historical phenom-
                                ena. This perspective on spatiality is generally referred to as the ‘spatial turn’ [9, 10, 19], and, in
                                literary studies, it highlights the integral components of space representation in how we under-

                                CHR 2024: Computational Humanities Research Conference, December 4–6, 2024, Aarhus, Denmark
                                ∗
                                 All data and code can be found on GitHub: https://github.com/XXXX
                                £ daniel.kababgi@uni-bielefeld.de (D. Kababgi); gg524@cam.ac.uk (G. Grisot); federico.pennino2@unibo.it
                                (F. Pennino); berenike.herrmann@uni-bielefeld.de (B. Herrmann)
                                ȉ 0009-0002-0990-6418 (D. Kababgi); 0000-0002-3038-6202 (G. Grisot); 0000-0001-7563-070X (F. Pennino);
                                0000-0002-5256-0566 (B. Herrmann)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                1
                                    The dataset and the code can be found here: https://github.com/DanielKababgi/spatial_entities_classifier


                                                                                                             472
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
stand and contextualise narrative and fictional texts. The exploration of spatial representations
in literary works offers valuable insights into the landscapes, the constructed environments
and their social implications within narratives, as well as into the cultural and socio-political
constructs surrounding certain images. while there are several valuable proposals to a quanti-
tative approach to spatial research [24, 2], including for example a differentiation of space as
background and place more specifically as locus of events [23, 15], we however set a different
focus.
   In this paper, we present a case study on the prediction of what we call ‘non-named spatial
entities’ (NNSE) in a historical corpus of Swiss-German novels using a deep learning model
in conjunction with BERT and Prodigy. By combining manual annotations and advanced ma-
chine learning methods, we aim to automatically detect and recognise NNSE within the literary
narratives via a similar approach to named entity recognition (NER).
   NER techniques are used to identify and categorise text segments that refer to entities such
as people, places, or companies, and that ‘constitute proper names’ [11]. The latest NER tech-
niques rely on manually annotated text corpora, which are automatically analysed to build
models that capture language use and grammar. These models can then identify and classify
entities in new, unprocessed documents. State-of-the-art NER systems come with pre-built
models trained on extensive collections of annotated documents, like news articles. These
models typically perform well and are ideal for specific applications such as analysing cus-
tomer feedback or extracting locations and characters. However, when applied to documents
with linguistic features not well-represented in the training data, such as literary texts, NER
performance can decline, increasing the likelihood of errors, also increased for most languages
other than English.
   Within literary studies, various scholars have used NER, particularly to identify fictional
characters [20] , build social networks [5], identify geographical locations [3], to assign head-
ings to novels [16], or to analyse relationships between literary works [21, 22, 17]. Only few
researchers until recently however focused on the identification of NNSE, i.e. those terms, or
elements of space representation, which are not necessarily named geographical locations, like
Berlin, London, or Zurich, and that therefore typically cannot be located on a map. It is this
kind of entity that generally contributes most to effectively to create the so-called ‘storyworld’
[18]: simple terms or phrases such as ‘mountain’, ‘bridge’, ‘beach’ or ‘cave’, as well as objects
and architectural parts that make them tangible, such as ‘window’, ‘table’, or ‘wall’.
   A similar perspective has been considered by Schumacher, Flüh, and Nantke [14], who used
conditional random fields to automatically annotate among other things non-named places,
which is conceptually similar to NNSEs. However, the operationalization of space in their re-
search is focused on places that can be found on a map, leaving therefore the broader concept
of NNSEs unexplored. Also, the popular BookNLP toolkit by Bamman is able to work to iden-
tify what they call ’locations’ (for natural entities) and facilities’ (for man-made structures) in
English-language texts with an accuracy of up to 90%2 . While BookNLP is able to differenti-
ate NNSEs somewhat according to our own needs, we propose a distinction for ’facilities’ into
more discrete classes shown in Section 2.1. This paper sets out to fill this gap, training a model
that will help us identify spatial elements in narrative.

2
    As is shown on their ofÏcial GitHub page: https://github.com/booknlp/booknlp?tab=readme-ov-file


                                                       473
Table 1
Examples of words for each category of NNSE
        Category      Exampels
        interior      Abstellkammer (storage room), Wohnzimmer (living room), Küche (kitchen)
        urban         Bibliothek (library), Kloster (abby), Vorstadt (suburb)
        rural         Bauernhaus (farmhouse), Garten (garden), Schweinestall (pigsty)
        natural       Berg (mountain), Fluss (river), Wald (forest)


2. Method
2.1. Spatial categories
In order to train our model to recognise literary space, we decided to train it not only to be able
to recognise non-named spatial entities (NNSEs), but also to distinguish among four different
types of spatial environments. We decided to base our categorisation on the research by Grisot
and Herrmann [7], who looked at the sentiment encoded in narrative texts in relation to both
named and non-named entities. They used a dictionary based approach, collecting spatial
terms for geographical locations as well as for non-named entities, distinguishing in particular
the categories ‘rural’, ‘natural’ and ‘urban’. While these three categories offered a promising
base, we felt that for a more comprehensive perspective on the spaces and places rendered in
fictional texts we also needed to include spatial elements describing the interiors/indoor space
of buildings and rooms. We therefore created an additional category of NNSE, ‘interior’. Some
examples for each category are shown in Table 1 above.

2.2. Annotations and model training
To produce the training set, two annotators were provided with written guidelines and trained
in person to understand the difference between the various NNSE categories. They were then
instructed to use the platform Prodigy [12], which allowed them to read sentences from the
training set in random order, and to annotate NNSEs directly on the interface by adding labels
to individual tokens. For the annotation process, six novels were sampled from the complete
corpus of Swiss-German novels [8].3 The novels were then split into sentences (N=9,062),
which were manually annotated by the annotators. The annotators featured a high inter-
annotator agreement (Cohen’s Kappa) for the NNSE types ‘interior’, ‘natural’, and ‘rural’, and
medium agreement for ‘urban’4 , as well as a high agreement for the distinction of NNSEs
against a not-NNSE token (Cohen’s Kappa = 0.898).
   The values for annotators’ agreement in relation to individual types are shown in Table 2.
Also shown in Table 2 are the number of sentences in our dataset in which the various classes of
spatial entities were identified by manual annotation. Type O shows the number of sentences
where no NNSEs were identified in the annotation process. These amount to over 86% of the
3
  These are: Der Wetterwart (1905) by Jacob Christoph Heer, Heimatscholle (1914) by Maria Goswina von Berlepsch,
  Berge und Menschen (1911) by Heinrich Federer, Heidis Lehr- und Wanderjahre (1880) by Johanna Spyri, Friedli, der
  Kolderi (1891) by Carl Spitteler, and Martin Salander (1886) by Gottfried Keller.
4
  The low score for ’urban’ is explainable by the low number of occurrencesof NNSE of this type in the dataset.


                                                       474
Table 2
Types of NNSE in complete dataset with inter-annotator aggreement (Cohen’s Kappa) between the two
annotators for each class, number of sentences in the complete dataset in which the types of NNSE
were identified by manual annotation. Class O marks sentences with no NNSE.
                            Class      Cohen’s Kappa        n Occurrences      Percentage
                            interior         0.933                412              4.5
                            urban            0.608                196              2.2
                            rural            0.775                315              3.5
                            natural          0.857                328              3.6
                            O                0.896               7811              86.2


dataset, making the distribution of the five categories in our dataset unbalanced. However, the
NNSE categories are much closer together, ranging between 2.2% (urban) and 4.5% (interior) of
all sentences.
   After the annotation process, sentences were randomly assigned to either the training
dataset (80%, N=7,249) or the test dataset (20%, N=1,813). This was done according to com-
mon best practices for training a machine learning model [6]. To train the classifier, PyTorch
version 2.1.1 was used as the deep learning framework [13]. In conjunction with PyTorch, the
popular hugging face library (version 4.35.2) was used to load and interact with the language
model [25].
   The model gbert-large by deepset was utilised as input layer for the token classifier, since
it outperformed all other language models [4]. The model classifies each token of a given
text, and attempts to predict whether the token under consideration can be classified as a
NNSE (one of the four types mentioned above) or whether it can be considered not a spa-
tial term (O). It was tested if the model performs better with the complete, unbalanced training
dataset (N=7,249 sentences) or with a more balanced, downsampled training dataset (N=2,004
sentences). The downsampled training dataset was composed of sentences including at least
one NNSE (N=1,002) and a random selection of sentences with no NNSE of equal size. The
downsampled train dataset was then split again into a final training dataset (N=1,603) and a
validation set (N=401).
   The training was repeated for 17 epochs with a learning rate of 5e-6 and a dropout of 0.1.
These parameters are determined by extensive hyperparameter testing. Since the maximum
length of all sentences is 40 words or 61 tokens, the max length for the BERT model was set to
64 tokens.


3. Results
With the annotation and training process described above, we produced a classifier able to 1)
identify NNSEs in a given sentence, and 2) classify the identified NNSE as belonging to one of
the four discrete classes: ‘rural’, ‘urban’, ‘natural’ or ‘interior’.5 To find the best parameters,

5
    The results reported here should be understood as a report of an ongoing process, rather than as a final product.
    For example, a theoretically more sound model may understand ‘interior’ not as a category of its own, but allow


                                                         475
Figure 1: F1 scores on the validation dataset for all classes, including O, over 160 epochs. Highest
scores for all trained epochs are by class O, the lowest scores are for identifying urban NNSE


the F1 score for each class was calculated for validation after each epoch. While the F1-score
for the O class was very high for every epoch, the scores for the discrete categories of NNSE
fluctuated quite strongly.
   Figure 1 illustrates the performance evaluation of the classifier on the validation dataset
across different epochs. Class O, which represents tokens not classified as NNSE according
to our guidelines, consistently achieves an almost perfect F1 score of 1. However, after 17
epochs, there is a significant decline in performance. Among the other classes, the ’interior’
class performs best with an F1 score of 0.792, while the ’urban’ class performs worst with an
F1 score of 0.632 on the validation dataset. For a more detailed analysis, Figure 2 presents
the average F1 scores for the ’interior’, ’urban’, ’rural’, and ’natural’ classes on the validation
dataset, excluding class O. The black dotted line indicates the highest overall F1 score of 0.743.
   Table 3 below presents the final scores on the test dataset. Class O, with an F1 score of 0.99,
significantly outperforms all other classes, as is expected. The ’interior’ class follows with an
F1 score of 0.60. The remaining classes have F1 scores ranging between 0.64 and 0.53.

for categorization into interior-rural or interior-urban. We are planning to run a set of annotations on the interior
items to add this type of information and explore the differences.


                                                       476
Figure 2: Mean F1 score for the classes interior, urban, rural, and natural over 160 epochs. The dotted
black line indicates the highest mean F1-score of 0.743.


Table 3
F1 scores per class on the test dataset. After class O the second best score is for the class interior. The
worst performance is for class rural.
                                Class      F1 score    Precision    Recall
                                interior    0.6079         0.4673   0.8696
                                urban       0.5333         0.4034   0.7869
                                rural       0.5573         0.4620   0.7025
                                natural     0.6468         0.5353   0.8125
                                O           0.9984         0.9996   0.9972


   In addition to the general F1 scores for each class, we analysed false classifications across
all five classes. The separation between NNSEs and class O was highly effective, with tokens
belonging to class O rarely being incorrectly classified as NNSE. Conversely, the most common
error for all types of spatial entities was their misclassification as class O. Notably, the differ-
entiation between various spatial entities was generally accurate, except for the ’urban’ class,
which was occasionally misclassified as ’rural’.


                                                     477
Figure 3: Confusion matrix for all five classes. All classes are classified correctly as themselves, while
the biggest error is not being recognized as a NNSE. The biggest error after that is by misclassifing
’rural‘ as ’urban’ 13% error rate for being classified as ‘rural’.


   Figure 3 displays a confusion matrix for all classes. The highest error rate for all four NNSE
types involves their misclassification as class O, ranging from 6.9% for ’interior’ to 14% for
’rural’. The biggest error after that is by misclassifing ’rural‘ as ’urban’ 13% error rate for being
classified as ‘rural’.


4. Conclusion and further Research
In this work, we have developed a tool that already in its present state facilitates valuable
quantitative spatial research on 19th and early 20th century German-language literary cor-
pora. While out-of-the-box solutions typically only provide Named Entity Recognition (NER)
models, to the best of the authors’ knowledge, a classification of non-named spatial entities as
conducted here, classifying each entity into different types, has never been published before for
German. Schumacher, Flüh, and Nantke [14] developed a classifier based on Conditional Ran-
dom Fields (CRF), which includes several more categories than our classifier, but it can only
detect places, not their types. Bamman’s bookNLP is able to differentiate between locations,


                                                   478
which alignes with our ’natural’ type, and facilities, which covers ’urban’ and ’rural’ spatial
entities. For comparison, we also tested the large language model Llama3 7b [1] with a few-
shot prompt for recognising unnamed spatial entities (NNSE). The automatic classification on
the test dataset with Llama3 resulted in only a 5.6% partial match with the manual annotation,
and a 0.7% perfect match. The main issue was the model’s tendency to hallucinate new NNSEs
when attempting to continue a sentence, contrary to instructions.
   The high performance in distinguishing spatial entities from non-spatial tokens is unsurpris-
ing, as this was the least contentious aspect during the evaluation of the annotation process.
The high error rate of ’rural’ being misclassified as ’urban’ but not vice versa can be explained
by the prevalence of ‘rural’ space in the training data. Additionally, the boundary between
’rural’ and ’urban’, as described in the guidelines, is more ’fuzzy’ compared to the respective
distinctions to ’interior’ and ’natural’. This fuzziness may be aggravated by the inherent ambi-
guity of using sentences as training units.
   This classifier is considered a work in progress, as it has currently been exclusively trained on
Swiss-German texts from the late 19th to early 20th century. Potential improvements include
gathering more training data and adapting the gbert-large model to Swiss-German literary texts
from the long 19th century and beyond, as well as remodelling the categories to include interior-
urban and interior-rural.
   We plan to utilise this classifier to explore the remainder of the Swiss-German novel corpus
built by Herrmann and Grisot [8], qualitatively examining patterns in the representation of
space, with a particular focus on interior items. Subsequently, we intend to extend our research
to a broader corpus of German literature. Methodologically, we plan to evaluate the use of
synthetic training data provided by generative AI to enhance our model. One key aspect of
space that will be analysed in-depth in future research is the relationship between affect and
space in literature, building upon the previous work of Grisot and Herrmann [7].


References
 [1] AiMeta. Llama 3 Model Card. 2024. url: https://github.com/meta-llama/llama3/blob/ma
     in/MODEL%5C%5FCARD.md.
 [2] F. Barth. “Konzept und Klassifikation literarischer Raumentitäten”. In: Informatik 2020
     (2021), pp. 1281–1293. doi: 10.18420/inf2020\_120.
 [3] S. Bushell, J. O. Butler, D. Hay, and R. Hutcheon. “Digital Literary Mapping: I. Visualizing
     and Reading Graph Topologies as Maps for Literature”. In: Cartographica: The Interna-
     tional Journal for Geographic Information and Geovisualization 57.1 (2022), pp. 11–36. doi:
     10.3138/cart-2021-0008.
 [4] B. Chan, S. Schweter, and T. Möller. German’s Next Language Model. 2020. arXiv: 2010.1
     0906 [cs]. url: http://arxiv.org/abs/2010.10906. Pre-published.
 [5] N. Dekker, T. Kuhn, and M. Van Erp. “Evaluating Named Entity Recognition Tools for
     Extracting Social Networks from Novels”. In: PeerJ Computer Science 5 (2019), e189. doi:
     10.7717/peerj-cs.189.


                                               479
 [6] A. Géron. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts,
     Tools, and Techniques to Build Intelligent Systems. Second edition. Covid-19 Collection.
     Beijing Boston Farnham Sebastopol Tokyo: O’Reilly, 2019. 819 pp.
 [7] G. Grisot and B. Herrmann. “Examining the Representation of Landscape and Its Emo-
     tional Value in German-Swiss Fiction between 1840 and 1940”. In: Journal of Cultural
     Analytics 8.1 (2023). doi: 10.22148/001c.84475.
 [8] G. Grisot and B. Herrmann. Swiss German Novel Collection (ELTeC-gsw). Version 2.0.0.
     Zenodo, 2021. doi: 10.5281/zenodo.4584544.
 [9] W. Hallet and B. Neumann. Raum Und Bewegung in der Literatur: Die Literaturwis-
     senschaften Und Der Spatial Turn. Bielefeld: transcript Verlag, 2015.
[10]   E. W. B. Hess-Lüttich. “Spatial Turn: On the Concept of Space in Cultural Geography
       and Literary Theory”. In: Meta-Carto-Semiotics 5.1 (2017), pp. 27–37.
[11]   D. Jurafsky and J. H. Martin. Speech and Language Processing. 2024.
[12]   I. Montani and M. Honnibal. Prodigy: A Modern and Scriptable Annotation Tool for Creat-
       ing Training Data for Machine Learning Models. Explosion, 2018. url: https://prodi.gy/.
[13]   A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N.
       Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,
       S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. “PyTorch: An Imperative
       Style, High-Performance Deep Learning Library”. In: Advances in Neural Information
       Processing Systems 32. Curran Associates, Inc., 2019, pp. 8024–8035.
[14]   M. Schumacher, M. Flüh, and J. Nantke. “Place and Space in Literature Named Entity
       Recognition as a Possibility for Spatial Modelling in Computational Literary Studies”.
       In: Geographical Research in the Digital Humanities. Ed. by F. Dammann and D. Kremer.
       transcript Verlag, 2024, pp. 83–112. doi: 10.14361/9783839469187-006.
[15]   M. K. Schumacher. Orte und Räume in Romanen. Berlin, Heidelberg: Springer Berlin Hei-
       delberg, 2023. 234 pp. doi: 10.1007/978-3-662-66035-5\_1.
[16]   M. Short. “Text Mining and Subject Analysis for Fiction; or, Using Machine Learning
       and Information Extraction to Assign Subject Headings to Dime Novels”. In: Cataloging
       & Classification Quarterly 57.5 (2019), pp. 315–336. doi: 10.1080/01639374.2019.1653413.
[17]   R. T. Tally, ed. Geocritical Explorations: Space, Place, and Mapping in Literary and Cultural
       Studies. New York: Palgrave Macmillan US, 2011. doi: 10.1057/9780230337930.
[18]   R. T. Tally. “The Space of the Novel”. In: The Cambridge Companion to the Novel. Ed. by
       E. Bulson. 1st ed. Cambridge University Press, 2018, pp. 152–167. doi: 10.1017/97813166
       59694.011.
[19]   R. T. Tally. Topophrenia: Place, Narrative, and the Spatial Imagination. Indiana University
       Press, 2019-01-02. doi: 10.2307/j.ctv7r40df. JSTOR: 10.2307/j.ctv7r40df.


                                                480
[20]   H. Vala, D. Jurgens, A. Piper, and D. Ruths. “Mr. Bennet, His Coachman, and the Arch-
       bishop Walk into a Bar but Only One of Them Gets Recognized: On The DifÏculty of
       Detecting Characters in Literary Texts”. In: Proceedings of the 2015 Conference on Em-
       pirical Methods in Natural Language Processing. Proceedings of the 2015 Conference on
       Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for
       Computational Linguistics, 2015, pp. 769–774. doi: 10.18653/v1/D15-1088.
[21]   K. Van Dalen-Oskam. “Names in Novels: An Experiment in Computational Stylistics”. In:
       Literary and Linguistic Computing 28.2 (2013), pp. 359–370. doi: 10.1093/llc/fqs007.
[22]   K. Van Dalen-Oskam, M. Marx, I. Sijaranamual, K. Depuydt, B. Verheij, and V. Geirnaert.
       “Named Entity Recognition and Resolution for Literary Studies”. In: Computational Lin-
       guistics in the Netherlands Journal 4 (2014), pp. 121–136. url: http://www.clinjournal.or
       g/node/62.
[23]   G. Viehhauser. “Zur Erkennung von Raum in Narrativen Texten: Spatial Frames Und
       Raumsemantik Als Modelle Für Eine Digitale Narratologie Des Raums”. In: Reflektierte
       Algorithmische Textanalyse. Ed. by N. Reiter, A. Pichler, and J. Kuhn. De Gruyter, 2020,
       pp. 373–388. doi: 10.1515/9783110693973-015.
[24]   G. Viehhauser-Mery and F. Barth. “Towards a Digital Narratology of Space”. In: Book of
       Abstracts DH2017. Dh2017. Montreal, 2017. url: https://dh2017.adho.org/abstracts/413
       /413.pdf.
[25]   T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
       M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L.
       Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush. HuggingFace’s Transformers: State-
       of-the-art Natural Language Processing. Version 5. 2019. doi: 10.48550/arxiv.1910.03771.
       Pre-published.


                                               481